6.4 — Beware of Duplicates When Joining
What You'll Learn
- Why joining tables can create duplicate rows
- How to diagnose and prevent unwanted duplicates
- Best practices for safe joins
Sample Tables
Table sales:
SaleID | ProductID | Amount |
---|---|---|
1 | 101 | 120 |
2 | 102 | 80 |
3 | 101 | 50 |
4 | 103 | 90 |
Table products:
ProductID | ProductName |
---|---|
101 | Apples |
101 | Apples |
102 | Oranges |
103 | Bananas |
Why Do Duplicates Happen?
When you join tables in SQL, each row from the first table is matched with every row from the second table that meets the join condition.
If either table has duplicate values in the join column, you may get more rows than expected.
Example:
Suppose your sales
table has multiple sales for the same ProductID
, and your products
table also has duplicate ProductID
values (perhaps due to data entry errors):
sql
SELECT s.SaleID, p.ProductID, p.ProductName, s.Amount
FROM sales s
INNER JOIN products p
ON s.ProductID = p.ProductID;
- If
products
has two rows for the sameProductID
, each sale for that product will appear twice in the result.
How to Diagnose Duplicates
- Check for duplicates in your join columns before joining:
sql
SELECT ProductID, COUNT(*)
FROM products
GROUP BY ProductID
HAVING COUNT(*) > 1;
- If this returns any rows, you have duplicate keys in
products
.
How to Prevent or Handle Duplicates
- Ask to fix the data quality issue by deduplicating rows in the source tables before joining.
- Ensure your join columns are unique identifiers (primary keys) in at least one table.
- Option: Use
SELECT DISTINCT
after your join to remove duplicate results (but this only hides underlying data issues).
Key Points
- Duplicates in join columns can multiply your result rows unexpectedly.
- Always check your data for duplicates before joining.
- Use unique keys for reliable joins.
Next Steps
Now that you understand the risks of duplicates, you’re ready to confidently join tables and build robust SQL analyses.