Proactive Data Pipeline Maintenance via Machine Learning-Driven Anomaly Detection

greg

Proactive Data Pipeline Maintenance via Machine Learning-Driven Anomaly Detection

Last Updated 1 day ago

About

This canvas replicates the workflow from the paper by Akash Vijayrao Chaudhari and Pallavi Ashokrao Charate implementing a synthetic data pipeline throughput dataset with injected anomalies including spikes, drops, and schema drift. It uses Isolation Forest as the primary anomaly detection model, tuning its contamination parameter via grid search to optimize detection performance (accuracy, precision, recall, and F1).
The canvas includes detailed evaluation blocks such as confusion matrix heatmaps, performance metrics, time series anomaly overlays, and visualizations of throughput and schema version changes over time. Custom thresholds are applied to balance sensitivity, particularly to detect schema drift events.
Overall, the canvas reproduces the core anomaly detection and evaluation methodology described in the paper, providing a clear, extensible environment for experimentation and further research in proactive data pipeline maintenance.

Build something you can ship

Explore, analyze and deploy your first project in minutes