๐Ÿ€Zerve chosen as NCAA's Agentic Data Platform for 2026 Hackathonยท๐ŸงฎMeet the Zerve Team at Data Decoded Londonยท๐Ÿ“ˆWe're hiring โ€” awesome new roles just gone live!
Back
Pandas

How to Drop Duplicates Based on Specific Columns in Pandas

Answer

Use df.drop_duplicates(subset=['col1', 'col2']) to remove duplicate rows based on specific columns only. By default it keeps the first occurrence โ€” use keep='last' or keep=False to change this behavior.

Why This Happens

Raw data often has duplicates that shouldn't exist โ€” duplicate customer entries, repeated transactions, or multiple records from data pipeline reruns. You need to dedupe on the columns that define uniqueness (like user_id), not necessarily all columns.

Solution

The rule: always specify subset to control which columns define uniqueness, and check duplicates with .duplicated() before dropping to understand what you're removing.

import pandas as pd

df = pd.DataFrame({
    'user_id': [1, 1, 2, 3, 3],
    'name': ['Alice', 'Alice', 'Bob', 'Charlie', 'Charlie'],
    'order': [100, 200, 150, 300, 300]
})

# โœ… Drop duplicates based on specific column
df.drop_duplicates(subset=['user_id'])
# Keeps first occurrence of each user_id

# โœ… Drop duplicates based on multiple columns
df.drop_duplicates(subset=['user_id', 'order'])

# โœ… Keep last occurrence instead of first
df.drop_duplicates(subset=['user_id'], keep='last')

# โœ… Drop all duplicates (keep none)
df.drop_duplicates(subset=['user_id'], keep=False)

# โœ… Modify in place
df.drop_duplicates(subset=['user_id'], inplace=True)

# โœ… Check for duplicates before dropping
df.duplicated(subset=['user_id']).sum()  # count duplicates
df[df.duplicated(subset=['user_id'], keep=False)]  # view all duplicate rows

Better Workflow

Zerve persists state at the cell level, so you can experiment with different keep options and see exactly which rows get dropped โ€” without losing your original dataframe if you get it wrong.

Better workflow

Related Topics

Decision-grade data work

Explore, analyze and deploy your first project in minutes