Pandas

ParserError: Error Tokenizing Data — How to Fix It

Answer

This error means pandas can't parse your CSV because rows have inconsistent column counts, there's a malformed line, or the delimiter is wrong. Fix it by specifying on_bad_lines='skip' to skip problem rows, setting the correct delimiter, or using engine='python' for more flexible parsing.

Why This Happens

CSVs in the wild are messy. Common causes: some rows have extra commas, quoted fields contain unescaped delimiters, the file uses a semicolon or tab instead of comma, there's a corrupted line mid-file, or header row doesn't match data rows. Pandas' default C parser is fast but strict — it fails on any inconsistency.

Solution

The rule: when you hit a parser error, first check delimiter, then try on_bad_lines='warn' to see what's actually broken.

import pandas as pd

# ❌ Problematic: default parser fails on messy CSV
df = pd.read_csv('messy_file.csv')
# ParserError: Error tokenizing data. C error: Expected 5 fields in line 47, saw 6

# ✅ Fixed: skip bad lines
df = pd.read_csv('messy_file.csv', on_bad_lines='skip')

# ✅ Fixed: use python engine (slower but more forgiving)
df = pd.read_csv('messy_file.csv', engine='python', on_bad_lines='skip')

# ✅ Fixed: specify correct delimiter if not comma
df = pd.read_csv('messy_file.csv', delimiter=';')

# ✅ Debug: find the bad lines first
df = pd.read_csv('messy_file.csv', on_bad_lines='warn')  # prints which lines fail

Better Workflow

Zerve lets you iterate on parsing logic quickly — run a cell, see what breaks, adjust parameters, re-run — without restarting your whole environment. Faster feedback loop for wrangling messy files.