Pandas

How to Drop Duplicates Based on Specific Columns in Pandas

Answer

Use df.drop_duplicates(subset=['col1', 'col2']) to remove duplicate rows based on specific columns only. By default it keeps the first occurrence — use keep='last' or keep=False to change this behavior.

Why This Happens

Raw data often has duplicates that shouldn't exist — duplicate customer entries, repeated transactions, or multiple records from data pipeline reruns. You need to dedupe on the columns that define uniqueness (like user_id), not necessarily all columns.

Solution

The rule: always specify subset to control which columns define uniqueness, and check duplicates with .duplicated() before dropping to understand what you're removing.

import pandas as pd

df = pd.DataFrame({
    'user_id': [1, 1, 2, 3, 3],
    'name': ['Alice', 'Alice', 'Bob', 'Charlie', 'Charlie'],
    'order': [100, 200, 150, 300, 300]
})

# ✅ Drop duplicates based on specific column
df.drop_duplicates(subset=['user_id'])
# Keeps first occurrence of each user_id

# ✅ Drop duplicates based on multiple columns
df.drop_duplicates(subset=['user_id', 'order'])

# ✅ Keep last occurrence instead of first
df.drop_duplicates(subset=['user_id'], keep='last')

# ✅ Drop all duplicates (keep none)
df.drop_duplicates(subset=['user_id'], keep=False)

# ✅ Modify in place
df.drop_duplicates(subset=['user_id'], inplace=True)

# ✅ Check for duplicates before dropping
df.duplicated(subset=['user_id']).sum()  # count duplicates
df[df.duplicated(subset=['user_id'], keep=False)]  # view all duplicate rows

Better Workflow

Zerve persists state at the cell level, so you can experiment with different keep options and see exactly which rows get dropped — without losing your original dataframe if you get it wrong.