Logo
Published on

Duplicate Handling

How Duplicate Handling Ensures Clean Data Merging

Mastering duplicate removal when merging arrays prevents data pollution and ensures clean datasets. Whether combining user lists, merging tags, or aggregating unique IDs, proper deduplication strategies maintain data integrity while avoiding performance pitfalls of naive approaches.

TL;DR

  • Use [...new Set([...arr1, ...arr2])] for simple deduplication
  • Essential for merging user lists, tags, and unique ID collections
  • Prevents data pollution when combining multiple data sources
  • Compare strategies: Set vs filter vs Map for different use cases
const unique = [...new Set([...arr1, ...arr2])]

The Duplicate Handling Challenge

You're building a user management system that combines user lists from multiple sources - database queries, API responses, and cached data. The naive approach creates duplicate entries that corrupt reports and break business logic.

// The problematic approach without deduplication
const activeUsers = ['user1', 'user2', 'user3']
const premiumUsers = ['user2', 'user4', 'user5']
const recentUsers = ['user1', 'user5', 'user6']
function combineUsersOldWay(active, premium, recent) {
  const combined = active.concat(premium).concat(recent)
  console.log('Combined without deduplication:', combined)
  console.log('Total count (with duplicates):', combined.length)
  return combined
}
console.log('Result:', combineUsersOldWay(activeUsers, premiumUsers, recentUsers))

Modern duplicate handling with Set operations creates clean, unique datasets that prevent data corruption:

// The elegant deduplication solution
const activeUsers = ['user1', 'user2', 'user3']
const premiumUsers = ['user2', 'user4', 'user5']
const recentUsers = ['user1', 'user5', 'user6']
function combineUsersNewWay(active, premium, recent) {
  const uniqueUsers = [...new Set([...active, ...premium, ...recent])]
  console.log('Deduplicated users:', uniqueUsers)
  console.log('Unique count:', uniqueUsers.length)
  console.log(
    'Removed duplicates:',
    active.length + premium.length + recent.length - uniqueUsers.length
  )
  return uniqueUsers
}
// Test the deduplication
const result = combineUsersNewWay(activeUsers, premiumUsers, recentUsers)

Best Practises

Use duplicate handling when:

  • ✅ Combining user lists from multiple authentication sources
  • ✅ Merging tag collections where duplicates would cause confusion
  • ✅ Aggregating unique IDs from different API endpoints
  • ✅ Building search results that aggregate from multiple data sources

Avoid when:

  • 🚩 Working with massive arrays (>50k items) where Set creation is expensive
  • 🚩 Need to preserve duplicate counts for analytics or statistics
  • 🚩 Working with objects where identity comparison isn't sufficient
  • 🚩 Performance-critical code where every allocation matters

System Design Trade-offs

AspectSet + SpreadFilter + indexOfMap Deduplication
ReadabilityExcellent - clear intentGood - familiar patternGood - explicit tracking
PerformanceFast for primitivesSlow O(n²) complexityFastest for objects
Memory UsageModerate - creates SetLow - in-place filteringHigh - maintains Map
Object SupportLimited - reference equalityFull - custom comparisonExcellent - key extraction
ImmutabilityCreates new arrayCan modify in-placeCreates new array
Browser SupportES6+ requiredUniversalES6+ required

More Code Examples

❌ Manual dedup nightmare
// Traditional approach with manual duplicate checking
function mergeProductTagsOldWay(products) {
  const allTags = []
  // Collect all tags from all products
  for (let i = 0; i < products.length; i++) {
    const product = products[i]
    if (product.tags && Array.isArray(product.tags)) {
      for (let j = 0; j < product.tags.length; j++) {
        const tag = product.tags[j]
        // Manual duplicate checking
        let isDuplicate = false
        for (let k = 0; k < allTags.length; k++) {
          if (allTags[k] === tag) {
            isDuplicate = true
            break
          }
        }
        if (!isDuplicate) {
          allTags.push(tag)
        }
      }
    }
  }
  console.log('Traditional deduplication:')
  console.log('Unique tags found:', allTags.length)
  allTags.forEach((tag, index) => {
    console.log(`  ${index + 1}. ${tag}`)
  })
  return allTags.sort()
}
// Test data
const products = [
  { name: 'Laptop', tags: ['electronics', 'computer', 'portable'] },
  { name: 'Phone', tags: ['electronics', 'mobile', 'communication'] },
  { name: 'Tablet', tags: ['electronics', 'portable', 'touch'] },
  { name: 'Mouse', tags: ['computer', 'peripheral', 'input'] },
  { name: 'Headphones', tags: ['audio', 'portable', 'electronics'] },
]
const traditionalTags = mergeProductTagsOldWay(products)
console.log('\nTotal unique tags (traditional):', traditionalTags.length)
✅ Set deduplication shines
// Modern approach with Set-based deduplication
function mergeProductTagsModern(products) {
  // Elegant one-liner for complete deduplication
  const uniqueTags = [
    ...new Set(
      products
        .filter((product) => product.tags && Array.isArray(product.tags))
        .flatMap((product) => product.tags)
    ),
  ]
  console.log('Modern Set deduplication:')
  console.log('Unique tags found:', uniqueTags.length)
  uniqueTags.sort().forEach((tag, index) => {
    console.log(`  ${index + 1}. ${tag}`)
  })
  return uniqueTags.sort()
}
// Advanced object deduplication with Map
function mergeUserLists(userLists) {
  const allUsers = userLists.flat()
  const uniqueUsersMap = new Map()
  allUsers.forEach((user) => {
    uniqueUsersMap.set(user.id, user)
  })
  const uniqueUsers = [...uniqueUsersMap.values()]
  console.log('Object dedup: before', allUsers.length, 'after', uniqueUsers.length)
  return uniqueUsers
}
// Test data
const products = [
  { name: 'Laptop', tags: ['electronics', 'computer', 'portable'] },
  { name: 'Phone', tags: ['electronics', 'mobile', 'communication'] },
  { name: 'Tablet', tags: ['electronics', 'portable', 'touch'] },
]
const modernTags = mergeProductTagsModern(products)
console.log('\nTotal unique tags (modern):', modernTags.length)
// Test object deduplication
const userLists = [
  [{ id: 1, name: 'Alice', role: 'admin' }],
  [
    { id: 1, name: 'Alice', role: 'admin' },
    { id: 2, name: 'Bob', role: 'user' },
  ],
]
mergeUserLists(userLists)

Technical Trivia

The LinkedIn Connection Duplication Crisis of 2021: LinkedIn faced a major user experience issue when their contact import feature failed to properly deduplicate merged contact lists. Users reported seeing duplicate connections and broken networking recommendations, leading to a 15% drop in feature usage before engineers identified the root cause.

Why deduplication mattered: The bug occurred when users imported contacts from multiple sources (Google, Outlook, phone contacts). The system used array concatenation without Set-based deduplication, causing the same person to appear multiple times with slightly different data formats, breaking the social graph algorithms.

Set operations ensure data integrity: Modern deduplication with `[...new Set([...contacts1, ...contacts2])]` prevents identity pollution in social networks. Combined with Map-based deduplication for objects, this approach maintains clean user relationship data and prevents the algorithmic confusion that degraded user experience.


Master Duplicate Handling: Clean Data Strategy

Use Set-based deduplication when merging arrays of primitives (strings, numbers, IDs) from multiple sources. For object deduplication, choose Map-based approaches that key on unique identifiers. Only resort to manual filtering for complex comparison logic or when working with arrays so large that Set creation becomes a bottleneck.