How to Drop Duplicates Based on Specific Columns in Pandas
Answer
Use df.drop_duplicates(subset=['col1', 'col2']) to remove duplicate rows based on specific columns only. By default it keeps the first occurrence โ use keep='last' or keep=False to change this behavior.
Why This Happens
Raw data often has duplicates that shouldn't exist โ duplicate customer entries, repeated transactions, or multiple records from data pipeline reruns. You need to dedupe on the columns that define uniqueness (like user_id), not necessarily all columns.
Solution
The rule: always specify subset to control which columns define uniqueness, and check duplicates with .duplicated() before dropping to understand what you're removing.
import pandas as pd
df = pd.DataFrame({
'user_id': [1, 1, 2, 3, 3],
'name': ['Alice', 'Alice', 'Bob', 'Charlie', 'Charlie'],
'order': [100, 200, 150, 300, 300]
})
# โ
Drop duplicates based on specific column
df.drop_duplicates(subset=['user_id'])
# Keeps first occurrence of each user_id
# โ
Drop duplicates based on multiple columns
df.drop_duplicates(subset=['user_id', 'order'])
# โ
Keep last occurrence instead of first
df.drop_duplicates(subset=['user_id'], keep='last')
# โ
Drop all duplicates (keep none)
df.drop_duplicates(subset=['user_id'], keep=False)
# โ
Modify in place
df.drop_duplicates(subset=['user_id'], inplace=True)
# โ
Check for duplicates before dropping
df.duplicated(subset=['user_id']).sum() # count duplicates
df[df.duplicated(subset=['user_id'], keep=False)] # view all duplicate rowsBetter Workflow
Zerve persists state at the cell level, so you can experiment with different keep options and see exactly which rows get dropped โ without losing your original dataframe if you get it wrong.
)
&w=1200&q=75)
&w=1200&q=75)
&w=1200&q=75)