Data Cleaning Pipeline banner

Data Cleaning Pipeline

9 devlogs
7h 56m 20s

I am making Data Cleaning Pipeline where users will upload their pdf,csv or excel and the app will scan it, tell you what is wrong, and let you fix it. You can let the AI handle everything automatically or go through each issue yourself and decide exactly which columns to touch!!!

This project uses AI

Used ai to write readme and to make structuring of code looks nice with comments!!!

Demo Repository

Loading README...

aneezakiran07

Shipped this project!

Hours: 7.94
Cookies: 🍪 150
Multiplier: 18.89 cookies/hr

HII!!!
So, I finally shipped this after working on it for a while and I’m happy with how it turned out(took me months :”) “).

The app now has a full Validation & Quality section with five new operations that can validate emails (flag or drop invalid ones), standardize messy phone numbers into one clean format, properly parse mixed-format dates, detect and cap outliers using IQR or Z-score with configurable thresholds, and validate value ranges to catch impossible values like age = -5 or score = 999.

The hardest part was definitely the date standardization because pandas was silently failing on most non-ISO formats, so I rewrote it from scratch to try 17 explicit formats per cell, which boosted accuracy from almost useless to actually reliable.

I also make every operation follow the same clean interaction with popup dropdowns, apply to all, checkboxes, and disabled action buttons until valid input is given
I built this because i love to do tasks related to data cleaning, so i realize why not make a tool generally for it :”)

Note: You can download test_data csv from my github to check this shipped project

aneezakiran07

HI!!!
In this devlog, I added a new Validation & Quality section in the data cleaning pipeline where I implemented five new validators to check emails, clean and standardize phone numbers, correctly parse mixed-format dates, detect and handle outliers, and validate value ranges. This update helped fix many hidden data issues, especially the date parser which now handles almost all common formats., and I also kept the UI consistent with the rest

0
aneezakiran07

HII!!
So in this devlog,
I turned the app into an intelligent assistant that actually thinks about your data.
It now scans the dataset, finds problems like duplicates, currency symbols, wrong data types, percentages as text, and missing values, and shows them as simple fix cards.
Each issue comes with a one-click “Fix This” button, and I also added checkboxes in front of AI suggestions so users can select exactly which columns they want to apply fixes to.
Added an “Auto-Fix All Issues” button that runs the full pipeline in the best order and fixes everything at once.
Now beginners can clean data without technical knowledge, while advanced users still have full control.

0
aneezakiran07

HII!!
So in this devlog,
I upgraded the website with a super-smart missing value handler. It now detects all kinds of missing data like “NA”, “?”, -999, and more, then fills them intelligently.
Numeric columns get KNN or MICE imputation depending on dataset size, while categorical columns get mode or “Missing” automatically.
Smart threshold drops columns with too many missing values, and everything can be controlled in the sidebar.
Also added a one-click “Full Pipeline” button that runs all cleaning steps in the best order, with detailed feedback showing exactly what changed.

0
aneezakiran07

HII!!
So in this devlog,
I upgraded the app into a smart data transformation system by adding intelligent string cleaning and automatic type detection.
The system now cleans text, detects patterns like currency, percentages, units, durations, and numeric values, and converts them automatically.
smart threshold system prevents wrong conversions, and users can control sensitivity using a settings sidebar.
also added a one-click “Run Basic Pipeline” button and detailed feedback showing exactly what was converted.

0
aneezakiran07

HI!!!
SO in this devlog, I improved the data cleaning pipeline with a cleaner UI and a real-time statistics dashboard showing rows, columns, missing cells, duplicates, and data types.
I added a flexible preview slider (5–50 rows), a collapsible column info panel, export options (CSV & Excel), and a reset button to restore the original dataset instantly.

0
aneezakiran07

I’ve built a Streamlit UI that lets you upload a CSV and instantly clean it by dropping duplicate rows/columns and stripping extra spaces from text. User will choose what functions he want to run using the provided buttons. I already pushed the code for these three core functions and added a data preview so you can see the results immediately. For the next session, I’m moving on to removing special characters and fixing missing values.

Attachment
Attachment
0
aneezakiran07

I spent an hour setting up a Streamlit interface to handle the tedious parts of data cleaning. I wrote three core functions that take any uploaded CSV and automatically fix it: one to strip hidden whitespaces from text, one to drop duplicate rows, and a third to find and remove identical columns. The goal was to make something generic so I don’t have to manually clean files every time I start a new project. It’s simple, fast, and handles the “dirty” data work in one click.

Attachment
0
aneezakiran07

I’m working on my first project! This is so exciting. I can’t wait to share more updates as I build.

Attachment
0