Skip to content

6.4 — Beware of Duplicates When Joining

What You'll Learn

  • Why joining tables can create duplicate rows
  • How to diagnose and prevent unwanted duplicates
  • Best practices for safe joins

Sample Tables

Table sales:

SaleIDProductIDAmount
1101120
210280
310150
410390

Table products:

ProductIDProductName
101Apples
101Apples
102Oranges
103Bananas

Why Do Duplicates Happen?

When you join tables in SQL, each row from the first table is matched with every row from the second table that meets the join condition.
If either table has duplicate values in the join column, you may get more rows than expected.

Example:

Suppose your sales table has multiple sales for the same ProductID, and your products table also has duplicate ProductID values (perhaps due to data entry errors):

sql
SELECT s.SaleID, p.ProductID, p.ProductName, s.Amount
FROM sales s
INNER JOIN products p
  ON s.ProductID = p.ProductID;
  • If products has two rows for the same ProductID, each sale for that product will appear twice in the result.

How to Diagnose Duplicates

  • Check for duplicates in your join columns before joining:
sql
SELECT ProductID, COUNT(*)
FROM products
GROUP BY ProductID
HAVING COUNT(*) > 1;
  • If this returns any rows, you have duplicate keys in products.

How to Prevent or Handle Duplicates

  • Ask to fix the data quality issue by deduplicating rows in the source tables before joining.
  • Ensure your join columns are unique identifiers (primary keys) in at least one table.
  • Option: Use SELECT DISTINCT after your join to remove duplicate results (but this only hides underlying data issues).

Key Points

  • Duplicates in join columns can multiply your result rows unexpectedly.
  • Always check your data for duplicates before joining.
  • Use unique keys for reliable joins.

Next Steps

Now that you understand the risks of duplicates, you’re ready to confidently join tables and build robust SQL analyses.