Shipped this project!
HI!! i built a Data Cleaning Pipeline in Streamlit
About whole project
it comes with an AI assistant that reads your dataset, just column names, data types, null rates, duplicate counts, and up to 3 truncated sample values per column. it dont read any sensitive info and ur direct data! it tells you what needs fixing and why, and writes the pandas code for you.
beyond that: cleaning functions, custom validators for phone/email/date formats, downloadable PDF report, correlation heatmap, data type guesser, live quality score, distribution charts, session history with undo/redo, multi-filter stacking with AND/OR logic, CSV and Excel export, and a whole lot more. tested up to 100K+ rows.
also! please be kind about performance. Streamlit reruns the whole script on literally every button click which makes optimization hard.
since the last version, here’s what changed:
- API key moved from URL param to request header so it no longer leaks in tracebacks or network tabs
- pipeline JSON imports now validated with an allowlist so crafted inputs can’t reach cleaning functions
- find and replace now takes a full table of pairs, all applied against original values so they can’t cascade
- each row shows a live match count as you type, toast after run shows exactly what changed per pair
test everything and tell me what you think. genuinely curious what features would actually be useful to you. Thanks!!!
Note:
if something breaks mid-pipeline, just refresh and hit resume your session ,it’ll pick up right where you left off. tho im sure i handled all edge cases