Pandas

How to Read Large CSV Files in Chunks — Python Pandas

Answer

Use the chunksize parameter in pd.read_csv() to read the file in smaller pieces. This returns an iterator you can loop through, processing each chunk without loading the entire file into memory.

Why This Happens

Large CSV files crash your machine when pandas tries to load everything at once. Chunking lets you process files larger than your available RAM by working on one piece at a time — filter rows, aggregate data, or write to a database incrementally.

Solution

The rule: choose a chunksize that fits in memory (start with 100k rows, adjust based on your column count and dtypes), then process each chunk in a loop.

import pandas as pd

# ✅ Read in chunks and process iteratively
chunks = pd.read_csv('large_file.csv', chunksize=100000)

for chunk in chunks:
    # Process each chunk (filter, transform, save)
    filtered = chunk[chunk['status'] == 'active']
    # Do something with filtered chunk

# ✅ Aggregate across chunks
total = 0
for chunk in pd.read_csv('large_file.csv', chunksize=100000):
    total += chunk['sales'].sum()

# ✅ Filter and combine into final dataframe
chunks = pd.read_csv('large_file.csv', chunksize=100000)
filtered_chunks = []

for chunk in chunks:
    filtered = chunk[chunk['value'] > 100]
    filtered_chunks.append(filtered)

result = pd.concat(filtered_chunks, ignore_index=True)

# ✅ Combine with usecols for even less memory
chunks = pd.read_csv('large_file.csv', 
                     chunksize=100000, 
                     usecols=['col1', 'col2'])  # only load needed columns

# ✅ Write chunks directly to database or file
for chunk in pd.read_csv('large_file.csv', chunksize=100000):
    chunk.to_sql('table', connection, if_exists='append')

Better Workflow

Zerve runs on cloud infrastructure with Lambda, Fargate, GPU, and Kubernetes compute options. For truly large files, you can run your chunking job on beefier hardware instead of fighting your laptop's limits.