Data Cleaning Pipeline banner

Data Cleaning Pipeline

50 devlogs
57h 13m 54s

Data Cleaning Pipeline where users will upload their csv or excel and the app will scan it, tell you what is wrong, and let you fix it. You can let the AI handle everything automatically or go through each issue yourself and decide exactly which c…

Data Cleaning Pipeline where users will upload their csv or excel and the app will scan it, tell you what is wrong, and let you fix it. You can let the AI handle everything automatically or go through each issue yourself and decide exactly which columns to touch!!!
Built it for data analysts, researchers, students (ME), and anyone who’s ever wasted hours cleaning a spreadsheet before they could even start their actual work.

This project uses AI

use ai for debugging of streamlit UI issues!
use ai for writing the prompt used to send to gemini for AI assistant,AI overview and AI guide

Demo Repository

Loading README...

aneezakiran07

sandboxed exec to block code injection in AI cleaner

I wasn’t thinking much about the security, but one kind and cool voter commented on one major security issue of the system so i get to work on it! Now i will work on more security features to make this pipeline secure to use!

Before

  • global execution environment allowed full access to the os and subprocess modules

  • no import blocking meant __import__('os').system(...) worked

  • dunder attributes like __class__ and __builtins__ were reachable from the user interface

  • manual edits to the generated code bypassed all internal security constraints

Now

  • static ast analysis rejects every import statement and forbidden name branch

  • regex pattern scanning catches shell style calls before the parser initializes

  • execution globals replace the __builtins__ default with a strict twenty five item whitelist

  • validation logic triggers on both the initial generation and the final apply action

0
aneezakiran07

Shipped this project!

Hours: 33.34
Cookies: 🍪 1181
Multiplier: 29.51 cookies/hr

HI!!
so I built a data cleaning pipeline in Streamlit.
it have features like a Ai assistant that guides you how to clean your data, what your data is about and generate code for you.then
Cleaning functions, custom validators for phone email and date, PDF cleaning report, correlation heatmap, data type guesser, live quality score, distribution charts, persistent session history, undo/redo, multi filter stacking with AND OR modes, CSV and Excel export, and much more!
It works on small and large dataset with 100K+ rows also!
ALSO!!! please do not judge performance too harshly. Streamlit reruns the entire script on every single interaction which makes optimization genuinely EVIL! . I did everything I could. lazy tab rendering, button gated expensive calls, df key based caching, chunked CSV reads, memory downcastingm and sampled analysis on large files.
I am switching frameworks in the next ship!! I’m done with the streamlit :”)
test every feature and drop your feedback. I really want to know what functions you would find useful next. Thanks!!!

NOTE

if u get any error during pipeline( though i really tried to handle all edge cases) then refresh it and choose resume your session, it will resume your session till the last step you performed before getting the error)
.
.
(IDK if i should say this but i really spend so much time building this, please vote fairly TT
and ik some of you might not use it rn but believe me when you will get school work someday or have to train your own model, then you will see how tedious this work is :”) )
.
.
VIDEO DEMO LINK: (if u cant make sense of this project, then watch this)
https://drive.google.com/file/d/1oI9FJehpTdsFCoxygUPd3RHjXB5-FK1O/view?usp=sharing

aneezakiran07

AI Assistant tab

Before

  • AI Cleaner was inside Clean tab and was easy to miss for users :)
  • AI Data Analysis and General Guide were both in Guide tab

Now

  • new ai_assistant.py replaces guide.pyand single file owns all three AI and guide sections
  • tab renamed to “AI Assistant”
  • AI Cleaner moved from Clean tab expander to top of AI Assistant tab
  • AI Data Analysis is below it
  • General Guide moves to the bottom
  • clean.py has the AI Cleaner expander and its import fully removed
  • app.py updated with new tab variable, new import
    also tested other features thoroughly so i can ship next
0
aneezakiran07

filter and inspect v3 match mode(AND OR) + export

Before

  • multiple filters always applied as AND ( And means this and this ,both should be satisified to show this data filtered)
  • no way to export filtered results( database peeps will know sometimes its usefull to export stuff after applying sql queries of select where value ==)
  • no tooltips explaining what filters do or how match mode works

Now

  • ALL (AND) vs ANY (OR) radio toggle added at top with tooltip explaining both modes clearly with examples
  • ALL mode narrows down by chaining filters on the result of the previous filter
  • ANY mode builds a union mask across all filters applied independently to the original df
  • export section added below the dataframe with CSV and Excel download buttons
  • export always uses the full filtered result not the 500-row display cap i made for performance
  • tooltips added to match mode radio, add filter, clear all, all three metrics, and both export buttons
0
aneezakiran07

filter preview v2: multi-filter stacking + UI polish

Before

  • only one filter could be active at a time
  • no way to narrow down by multiple cols simultaneously
  • bw operator was missing so range checks needed two separate filters

Now

  • filter rows stored as id list in st.session_state.fp_filters so any number can stack
  • each row gets a unique id-prefixed key so streamlit tracks them independently
  • x button removes individual rows cleanly without affecting others
  • add filter and clear all buttons added below the filter rows
  • bw operator added for numeric and datetime with two side-by-side inputs (Min/Max, From/To)
  • display capped at 500 rows with caption showing match count for performance
Attachment
0
aneezakiran07

filter preview v1: single column filter

Before

  • no way to inspect rows matching a condition without running an operation
  • users had to apply a fix and undo just to verify the right rows were targeted
  • full dataframe display with no filtering made spotting dirty values tedious

Now

  • filter_preview.py added as a read-only module, never writes to current_df
  • single filter row with column selector and condition selector
  • operators auto-detected per dtype: text gets contains/equals/starts with/ends with/is empty, numeric gets equals/gt/lt/is empty, datetime gets equals/before/after/is empty
  • match rate metrics row shows matching rows, total rows, and %age
  • filtered dataframe renders below with use_container_width=True
  • made it as a separate tab after overview so user can go to it to check the data
Attachment
Attachment
0
aneezakiran07

col_popover selection reset bug

Before

  • The columns user selected before get reset when user switch to other tab thus when he came back to the tab again, he had to deselect and select columns again for operations :)
  • on_change callbacks were the only sync mechanism between checkbox keys and val_selected
  • after a run, val_selected.pop() cleared the list but _vc_section_col keys stayed True in session state
  • streamlit never fired on_change again because widget value had not changed from its perspective

Now

  • col_popover reads _vc_ keys directly at render time and rebuilds val_selected from scratch every render
  • _make_all_handler now resets all individual _vc_ keys when toggling all on or off so checkboxes visually match
  • _make_col_handler syncs the all checkbox when individual columns are toggled

The issues i got while doing this:

  • col_popover was unconditionally rebuilding current from vc keys on every render. In streamlit, all tab contents execute on every rerun even when the tab isn’t visible (mess i know) .
  • so i had to fix this also , and i used this flow
  • If val_selected[section] already exists -> it was set by an on_change handler which is the definitive user action. Trust it, sync vc keys to match it for display, done.
  • If val_selected[section] is missing -> means either first ever render, or clear_popover just wiped it. Only then fall back to reading vc keys.

I say this earlier and will say it again i hate how streamlit works, worst framework ever, i might change to some other framework after this ship.

0
aneezakiran07

button and popover alignment fix

My buttons were kind of inconsistent for the recommendations and validate tab. My focus before was never UI so i didn’t bother changing them. But now, i think i should change them and make them consistent so here i’m

Before

  • Fix button and select columns popover rendered inside the same st.columns block as description txt
  • st.write("") spacers before buttons were unreliable and column heights varied per section thus making the UI looks inconsistent
  • recommendations and validate both had this problem..

Now

  • separate st.columns call handles only the popover and button on their own dedicated row
  • both widgets are always the same height so they always land on the same line
  • button click logic moved outside col blocks so st.rerun() fires cleanly after all columns close
Attachment
Attachment
0
aneezakiran07

Excecutable Validation JSON feature

Before

  • import_pipeline_json had no branches for any custom validation step
  • JSON preview only existed on the save side as a raw code dump
  • Uploaded JSON had zero preview, and also there was no toast on successful json application TT

Now

  • _parse_label_meta() moved to module level so both import_pipeline_json and build_pipeline_script share it
  • Each step checks col in tmp.columns before running and skips if column is missing
  • threshold, min, max are safely cast to float with fallback so bad values never crash
  • _friendly_label() added in history_export.py
  • _render_json_preview() added, shows numbered step list with easy to read labels instead of raw JSON so non tech also get it
  • uploaded pipeline auto-expands its preview so user sees steps before hitting Apply
0
aneezakiran07

Executable Validation Python code Export

(this was to be done for the custom validators i made in prev devlog)

Before

  • Custom Validation steps exported was not implemented yet
  • Custom regex, country code, date format, thresholds all were not working
  • Type Override had parts.strip() bug on a list :))

Now

  • _parse_label_meta() reads key=value pairs embedded in label strings
  • all 5 validation steps export real runnable code with exact user settings
  • each commit_history call embeds col + settings per column
  • bug fixed, stubs added for Split/Merge/Rename :”)
    next i will have to make the custom validators work for pipeline.json also
0
aneezakiran07

Global Undo and Redo in the Sidebar

Before

  • Undo and Redo were only inside History and Export, and this user has to switch tabs and couldnt access them easily
  • render_sidebar() had no access to history or redo_stack
  • Sidebar bottom stayed unused in Simple mode

Now

  • render_undo_redo() added in state.py and called in both sidebar modes
  • Reads from st.session_state, returns early means buttons will show even if no undo and redo has been performed so user know in the start that side bar has this feature :)
  • Click triggers toast + st.rerun() for full UI sync
  • st.caption() shows next undo/redo actions. so user know what is happening and also user will know what will happen if they press UNDO or REDO!
0
aneezakiran07

Custom Inputs for Date, Email, and Phone Validators

Now users can provide specific formats, patterns or country codes to bypass general pattern

Before

  • validate_date_col() in validate.py relied on a 14-format loop and C-level inference, which could be slow or misinterpret ambiguous dates (eg 01/02/03).

  • validate_email_col() used a hardcoded regex pattern that couldnt be overridden for specific corporate domains or stricter requirements.

  • validate_phone_col() was hardcoded to US standards, making it difficult to validate international numbers

Now

  • validate_date_col accepts a custom_input_format parameter. If no custom input format is set, then it pass the default US format

  • validate_email_col features a custom_pattern parameter. A ternary operator replaces the default regex if a custom one is provided. A new try/except block catches invalid regex syntax

  • validate_phone_col now accepts default_country_code. The logic strips leading + signs and uses dynamic boolean masks for country code and length.

UI

  • the tooltip for the functions explains the user about what to write in this custom pattern and how so they get how to use this feature
0
aneezakiran07

Add AI Executive Summary to PDF Report

Now user will get AI summary in PDF report also
Technical details:

Before

  • build_report_pdf() in ‘reporting.py’ accepted only ‘original_df’, ‘cleaned_df’, history, and filename with no Ai summary abt the dataset
  • The report opened with all the technical details, and a non-technical person couldn’t make sense of what even the data is abt
  • also the AI summary string already existed in st.session_state under ai_insights_{file_id} after the user clicked Generate AI Summary or Analyse my data in the Guide tab,
  • ‘history_export.py’ called build_report_pdf() with four positional args and moved on

Now

  • build_report_pdf() accepts an optional ai_summary: str = "" parameter
  • when ai_summary is non-empty, a light blue tinted LIGHT_BLUE background panel is rendered immediately below the title block and says run ai summary first to get the ai summary in the pdf
  • history_export.py reads st.session_state.get(f"ai_insights_{file_id}", {}) before calling the builder, pulls .get(“summary”, “”), and passes it through
0
aneezakiran07

Fix: Vectorized Phone and Date Validators

Before

  • validate_phone_col and validate_date_col both used .apply() to run a Python function row by row having HUGE waiting time
  • validate_date_col ran every row through a 14-format try/except loop even when pd.to_datetime would have parsed it fine in one go!

Now

  • validate_phone_col is fully vectorized: .str.replace(r"\D", "", regex=True) strips non-digits across the whole column in one C-level pass, then .str.len() and boolean masks build the +1, + prefixes with pd.Series.where
  • validate_date_col uses a two-pass strategy: first pass runs pd.to_datetime(..., infer_datetime_format=True) which handles the vast majority of dates at C speed in one go then second pass only touches the rows that failed and tries the explicit format list….so on clean data the slow path never runs at all

Also change the toast message and now it shows this ✓ along with the message!

0
aneezakiran07

Fix: Recommendations Tab State Wipe, Duration Regex False Positives

Before

  • fixing any issue in the recommendations tab wiped the entire tab state and dropped the user back to the “Scan for Issues” screen bcz scan_key was f"rec_scanned_{df_key}" and df_key is derived from make_df_key(current_df), which changes every time a fix mutates the dataframe, so the old key simply stopped existing in session state
  • the duration regex (h|hr|hour|min|minute|sec|second) had no word boundaries, so the bare h matched the letter h inside any word like John”, “the”, any english text at all would fire it and flag name or description columns as duration columns (found this one while testing on a movies dataset)

Now

  • scan_key is now a stable hardcoded string "rec_scanned" with no dependency on df_key, so it survives reruns after fixes; state.py explicitly resets it to False on every new file load and fresh session so it never carries over across files
  • duration regex tightened to \b(hrs?|hours?|mins?|minutes?|secs?|seconds?)\b with word boundaries in both get_analysis_and_recommendations and get_type_suggestions in cache.py; bare h removed entirely since no real duration data is ever just the letter h standing alone

also i didn’t know we can use ‘’ to show the code things!!! now i am gonna use them more!

0
aneezakiran07

Fix: Session Resume Dialog Crash and Loading Spinner

Before

  • clicking outside the resume dialog closed it without setting _resume_choice, leaving init_state in a broken half-initialized state and crashing on the next render

  • the loading spinner and info message never actually displayed because init_state ran synchronously on the same render pass, so the spinner appeared and vanished :) (handling edge cases are really tough)

Now

  • dismissible=False added to the dialog, disabling backdrop click and removing the X button so the user is forced to pick an option (read this thing in human computer interaction course)

  • loading is now two-pass: first rerun sets _resume_loading and renders the spinner, second rerun clears the flag and runs init_state normally, making the spinner genuinely visible between reruns

Fix: Recommendations Feedback and Overview Slider

 

Before

  • clicking Fix or Auto-Fix All in the recommendations tab gave no feedback :)

  • the overview tab crashed on single-row files with “Slider min_value must be less than max_value” when both values resolved to 1

 

Now

  • every Fix button in recommendations now wraps its operation in a spinner and fires a toast on success matching the exact same pattern used in the validate tab that i did in last devlogs

  • Auto-Fix All also shows a spinner and toasts on completion

  • the overview slider now guards against min equal to max by skipping the slider entirely on single-row files and always flooring the minimum at 1

Note:

 Also this took me 45 minute to do, but for some reason my devlog didnt got uploaded and i saw that just now :) so

0
aneezakiran07

Performance improvement: AI Insights and Page Load Fix

so first thing happen when user opened my app was to load the ai analysis overview and ai guide, thus user have to wait for 10 seconds for small file max to load all the tabs :)
now all tabs load instantly because now i have button buttons that user has to press to get the ai overview and guide!

Before:

  • opening the app with a 50k row file meant waiting 10 to 20 seconds before any tab appeared
  • guide.py called get_ai_insights automatically on every render which blocked the entire page
  • the overview tab had a separate streaming approach using st.write_stream which returned None in most streamlit versions so the summary never showed at all
  • also there were no cache that could be stored for one file so gemini don’t have to be called for it again and again on every reload/reupload of the file

Now:

  • both tabs share the exact same cache key ai_insights_file_id so one api call populates both
  • neither tab calls the api automatically on page load, the user triggers it with a button when they want it
  • guide tab shows an Analyse my data button and overview tab shows a Generate AI Summary button
  • whichever button the user clicks first, both tabs show their content on the next rerun with zero extra api calls
  • st.write_stream and the entire streaming approach were removed since streamlit tabs buffer all output and flush at once, streaming never worked there
  • ai_insights.py went from 310 lines to 190 by removing the dead streaming code
  • page now loads instantly regardless of file size, gemini is never called until the user asks for it
0
aneezakiran07

UI Feedback - Spinners, Toasts, and the Dialog Bug

so in this devlog, i focused on making the app feel responsive :)

Before:

  • when a user clicked any button, the app would run the operation, call st.rerun(), and then show a success banner on the next render
  • this was handled using _omsg in session state across clean.py and validate.py
  • the success message had no auto-dismiss, so it just stayed there until the next interaction
  • sometimes the message didn’t even show, depending on which tab rendered first
  • there was no loading feedback, so the app felt frozen during heavy operations
  • validate.py had zero spinners at all, meaning everything just paused WITHOUT even notifiying user
  • spinner messages (where present) were vague like “Processing…” with no real context
  • redo / history-related feedback existed but didn’t clearly communicate what was happening

Now:

  • replaced the entire _omsg pattern with a cleaner toast-based system
  • new flow: set st.session_state[“_toast”] = (msg, icon) -> call st.rerun() -> pop it at render -> show st.toast()
  • toasts automatically disappear after ~3 seconds, so no more stuck messages
  • they appear in the corner and don’t mess with layout at all (way cleaner UX)
  • users now immediately see feedback like “please wait…” instead of thinking the app died
  • instead of “Stripping whitespace…”, it now says things like:
    “Scanning 50,000 rows and stripping whitespace…”
  • fixed the dialog bug where “Resume previous session?” wouldn’t close
  • issue was st.rerun() inside @st.dialog only rerunning the fragment, fixed with: st.rerun(scope=“app”)
  • also cleaned up history_export.py removed old show_msg() + last_success_msg usage
  • migrated everything to toast systemSome of the functions still have poor feedback :)
    this stuff is really so annoying, esp streamlit, ughh!!! idk whether to improve user feedback or performance or add new functions.
0
aneezakiran07

HII!!

History tab UI improvements

So in this devlog, I improved the History tab by adding visual feedback for all operations and making the redo stack visible!

Before:

  • when a user clicked undo or redo, or Reset Data, Clear History, Apply Pipeline, or Generate Report, the app froze for a moment and refreshed giving lil hint that operation successful.

  • there was no visual feedback or spinner showing that the app was actually processing the request.

  • also the user could see their history of past actions but they had no idea what was inside the redo stack waiting to be redone.

  • some buttons didn’t even show a green success box after finishing!

Now:

  • I wrapped all the major operations inside history_export.py (like undo and redo) with with st.spinner(“… please wait…”): blocks.

  • now when a user clicks any button in the history tab they get an immediate spinning message telling them exactly what is happening.

  • I also imported show_msg from state.py so every single operation assigns a message to st.session_state.last_success_msg before it calls st.rerun().

  • now the user always gets a green success confirmation box once the spinner finishes!

  • added a new list right below the main history list that loops through the redo_stack.

  • now if the user undoes an action it pops up in the Available to Redo list so they know exactly what will happen if they click the Redo button.

next,i dont know what needs improvement, but i might focus more on User feedback :)

0
aneezakiran07

HII!!

Redo operation

So in this devlog, I integrated the redo operation into the history system!
Before:

  • the undo history lived in st.session_state and got saved to the .dp_sessions folder, but if a user accidentally clicked undo, they lost that cleaned state forever
  • The app could only go backwards, not forwards :”)
  • also when a user clicked undo on a large dataset the app just paused for a second and refreshed. there was no visual feedback that the pipeline was working on reverting the data
    .
    Now:
  • I added a redo stack to the pipeline. when a user clicks undo, the current dataframe it gets pushed onto a redo stack. if they click redo, it pops from that stack and goes back into the main history.
  • whenever the user makes a brand new cleaning change, the redo stack clears itself because the timeline has changed. I updated session_persist.py so the redo stack is saved to the disk and survives page reloads
    .
  • UI changes: wrapped the undo and redo functions inside st.spinner blocks inside the history export tab
  • now when the user clicks either button they get an immediate spinning message saying Undoing last action please wait or Redoing action please wait
  • added a new list right below the main history list that loops through the redo_stack. now if the user undoes an action it pops up in the Available to Redo list so they know exactly what will happen if they click the Redo button

next i will work on making the user feedback for the history tab more better! ik right now it sucks, but dont worry i will improve it!

0
aneezakiran07

Valid tab: UI Improvements

  • Now each function gives clear feedback
  • eg the email validator tab will say X number of emails has been flagged or removed
  • same with the phone validator or standardazing the dates
  • i didnt had to fix the UI for outlier cap and validate range function as i already have done that before
0
aneezakiran07

Clean Tab: Performance and UI Improvements

voters said before that my app sucks in feedback so now im trying to make it more better
first of all, i made these changes in clean tab

  • Every operation now calculates and reports the actual difference.

  • Strip whitespace counts changed cells per column.

  • Duplicate drop shows the percentage of data removed.

  • Smart cleaner lists the names of converted columns.

  • Missing value handler reports filled counts and dropped columns separately.

  • Type override warns when conversion failures create new nulls.

  • Find and replace reports the exact match count.

  • Outlier capping reports the exact number of values affected.

  • Type guesser adds a per-column progress bar.

  • Type guesser now runs via button instead of opening automatically for better performance

Next i will update the validate tab

0
aneezakiran07

Upload Feedback Fix- Small fix

Before:

  • user picked a file and saw nothing. no spinner, no message, nothing.
  • the upload tab showed the success message only after everything was done
  • so on Excel files with large sheets there was a silent 5 to 10 second gap where the app looked frozen and the user had no idea if the upload worked. some users switched tabs thinking nothing happened,

Now:

  • upload.py checks loaded_file_id against the current uploaded file_id

  • if they do not match the file was just picked but not yet parsed so it immediately shows Reading filename please wait in blue

  • this renders on the very first rerun after file selection before app.py has done any work at all

  • once init_state finishes and sets loaded_file_id the next rerun flips the message to the green success confirmation

  • user always sees feedback within one rerun of picking the file

One user complained that they often had to upload the file twice, now they will get clear feedback and their file will quickly upload

0
aneezakiran07

Lazy Tab Rendering

Before:

  • all tabs blocked on page load for 30 to 40 seconds on large files.
  • get_analysis_and_recommendations ran unconditionally at the top of recommendations.py
  • the correlation heatmap ran inside a closed expander on every render pass
  • streamlit executes everything inside expanders even when they are collapsed , meaning expensive functions was being ran even when user dont even need them :(

The cache.py :”) :

  • spent 30 minutes adding sampling and large file guards to every analysis function in cache.py thinking that would fix the slowness. it helped memory and accuracy in large files but did not fix the tab loading delay at all
  • problem was never the speed of the functions. it was that they were being called in the first place on every render even when the user had not asked for them.
  • the real fix had nothing to do with cache.py, i wasted my time working with cache.py tbh!

Now:

  • recommendations are behind a Scan for Issues button, and tab opens instantly.
  • scan results are cached in session state keyed on df_key so they survive reruns
  • but auto invalidate when the dataframe changes after a cleaning step.
  • correlation heatmap is behind a Compute Correlation button inside its expander.
  • same df_key keying pattern so results persist until data actually changes.
  • clean, validate, and history tabs were already instant and needed no changes.
  • on a 1 million row file all tabs now open after some wait(better than before tho).

Also now only the history tab is still stuck with only 2 functions, there is a bug thats keeping it from showing other functions instantly, i will work on that in next one, but other all tabs are fine, i know speed is a issue but for 10k rows , i can only do this much now :(
Edit: Ai also gave its summary after 2 min TT big wait time, but its a given for 10k rows :)) maybe one day i might become a pro and improve its performance, but to me performance is really hard to solve rn

0
aneezakiran07

Large File Support

Before:

  • everything assumed small datasets. pd.read_csv loaded the entire file into memory at once.
  • KNN and MICE imputers were called on 500k row files which would freezes or crashes the browser, also all analysis functions scanned every row every time.

Now:

  • CSV files are read in 50k row chunks and concatenated so peak memory stays low
  • int64 columns are downcast to int32 and float64 to float32 on files over 50k rows
  • this cuts memory roughly in half with a warning shown to the user about precision loss
  • KNN and MICE imputers now have a row limit, above 50k rows they fall back to mean imputation which runs in linear time and is safe for 500k rows(found this info from Medium article )
  • all expensive analysis functions like recommendations, correlation, quality score and type suggestions now run on a 20k row random sample on large files
  • histogram KDE is computed on a 10k sample while counts still use the full series
  • overview tab Now shows a warning banner when downcasting was applied
  • row count metrics now use comma formatting so 100000 renders as 100,000( yey kinda low level detail TT )

Note:

  • In the uploaded video, the data get loaded really quickly but the tabs show stuff really late

  • it took 30 to 40 second for that it was because of heat correlation and expensive analysis functions. In next update,i will try that only the tab with expensive functions slow down but other tabs should show up like the validate and clean tab as they really don’t have any stuff that should pre run

  • Also i attached a screenshot, check it out, my app also loaded a

1 MILLION rows TT

yay i had to write in the bold, also tho its a bit slow but I can always work on this part <3

Attachment
0
aneezakiran07

Persistent Session History

Before:

  • undo history lived only in st.session_state, a page refresh and everything is REMOVED
  • users lost all cleaning progress on reload which was evil ik

Now:

  • added session_persist.py as a standalone persistence module
  • history and both dataframes are pickled to home dir under .dp_sessions folder
  • sessions are keyed by md5 of filename plus filesize, NOT streamlit file_id
  • this was the core bug, file_id is a fresh UUID on every reload so the session was never found
  • sessions older than 2 days are auto cleaned on each init
  • commit_history and undo_last both autosave to disk after every mutation
  • on file load init_state checks for a saved session before doing anything
  • if a saved session exists a native st.dialog pops up showing time since save and step count
  • user picks continue to restore full history and current df or start fresh to wipe and begin clean
  • reset to original and clear history both delete the persisted file

How tough it was:
YES I NEED TO TELL THIS!!!
3 issues no 4:

  • first file_id changing every reload so session was never found
  • second windows path separator making tmp dir fail silently
  • third st.stop halting before current_df was set so the page just crashed quietly
  • fourth pycache serving the old pipeline.py so autosave never fired at all
    Literally tried so many methods to get this work TT i hate how session stuff works :(
0
aneezakiran07

Two updates and longer devlog cuz HACKATIME was down :((

AI Cleaner Improvements

Before:

Gemini blindly generated code without checking if the data was ready. dirty columns (currency symbols, % signs, units) caused runtime crashes without having no USER FEEDBACK :”)

NOw:

  • Gemini gets raw sample values per column
  • added feasibility check , now gemini reasons before writing any code
  • if app already has a feature for it, Gemini says go to exact tab + feature as its a guarenteed method to work , but it also generates code incase user want to run this
  • if column is dirty and operation assumes clean data , then a warning is shown to fix the column first with the right feature, but code is still generated if user wants to run it anyway
  • gemini timeout bumped to 30s and it auto-retries once on timeout

Performance Improvements

Before:

  • st.cache_data was hashing the ENTIRE dataframe on every cached function call
  • with 7 cached functions each receiving the full df, thats 7 full df hashes per rerun.
  • also st.cache_data.clear() was nuking ALL cache on every new file upload.
  • profile tab was computing correlation, missing heatmap, and before/after on every tab switch.
  • result: 20 to 30s load times even on datasets under 1000 rows :”)

Now:

  • added make_df_key() in cache.py
    all 7 cached functions now take df_key as first arg instead of the full df
  • Now we have 3 phases workflow
  • Phase 1 (~instant): All 8 tabs render immediately. user sees the full app right away
  • Phase 2 (~2-10s): app.py calls Gemini after all tabs are rendered, before gemini was calling before the loading of all the app which was causing the issue
  • Phase 3 (instant): st.rerun() fires, all tabs re-render with the cached insights already in session state
0
aneezakiran07

HI!
In this devlog i moved and upgraded the AI analysis feature

AI analysis moved to Guide tab

  • before this the AI fixes were showing inside the Recommendations tab
  • it felt cluttered there because recommendations already has its own rule-based list
  • moved the whole AI analysis into the Guide tab
  • Guide tab is where users go to understand what to do, so AI fits there ig

what the Guide tab AI section now includes

  • a plain English summary of what the dataset is about
  • a list of prioritised fix cards, most important first
  • each card shows the issue title, which column it affects, and why it matters
  • each card shows exactly which tab to go to and what to click to fix it
  • each card shows an AI Cleaner shortcut so the user can just type the fix instead
  • the shortcut tells them exactly what to type into the AI Cleaner in the Clean tab
  • the cards use real tab names and real button names from the actual UI
  • styled with a blue border for the how to fix section and green for the AI shortcut

what the Overview tab now includes

  • AI summary is also shown at the top of Overview as a duplicate
  • it uses the same cached result so it costs zero extra API calls
  • non-tech users see the summary immediately when they open Overview

how caching works

  • Gemini is called once per file load
  • result is stored in session state under ai insights file id
  • Guide tab and Overview tab both read from the same cache key
  • new file uploaded means a fresh call, same file means instant load

also

It works for the multi-sheet excels also! but you have to select the Sheet you want to get AI guide on!!!

0
aneezakiran07

HI!
In this devlog i built

AI data summary and AI fix suggester

  • one Gemini API call does two jobs at the same time
  • first job is a plain English summary of what the dataset looks like
  • second job is a prioritised list of fixes the AI recommends
  • both come back in one JSON object so we only spend one request per file
  • the result is cached in session_state keyed to the file id
  • this means even if the user switches between Overview and Recommendations tabs the call only happens once
  • new file uploaded means new call, same file means cached result

how the summary works

  • shows up at the top of the Overview tab
  • 2 to 4 plain English sentences about what the data is and its quality
  • non-tech users immediately understand what they are looking at

how the fix suggester works

  • shows up at the top of the Recommendations tab above the existing rule-based ones
  • each fix has a priority number, a column name, a short issue title, and a one-sentence reason
  • sorted by priority so the most important thing is always first
  • styled as cards with a blue left border and priority number
Attachment
Attachment
0
aneezakiran07

HI!
In this devlog i built

AI natural language cleaner

  • user types a plain English instruction like drop rows where Age is empty
  • app sends the instruction plus column names, types, and 5 sample rows to Gemini
  • prompt asks Gemini to return a JSON object with two fields, code and explanation
  • so i used gemini api key and made this ai natural language cleaner
  • using gemini-3.1-flash-lite-preview, fast and cheap and allows 500 requests per day
  • temperature set to 0.1 so output is consistent and not creative
  • maxOutputTokens set to 1024 to support multi-line code for complex queries
  • Gemini can write code of as many lines as needed
  • this means complex instructions like normalise then fill then rename all work in one go
  • code field is the pandas code to run
  • explanation field is a plain English sentence of what the code will do
  • app shows the explanation with “Verify that this matches what you want” so users know what will happen once they apply this code
  • app shows the code in an editable text area so tech users can changed it if needed
  • user clicks Apply to run it, or Cancel to discard
  • code runs in an isolated exec() with only df and pd available, nothing else
  • if execution fails the error is shown and the dataframe is not changed
  • on success the result goes into current_df and gets logged to History
  • API key is loaded from a .env file using python-dotenv
  • if the key is missing a warning is shown and the section is hidden
1

Comments

aneezakiran07
aneezakiran07 17 days ago

i changed the caption TT before i was running ollama for myself as it only runs locally, then i switched to gemini for u guys so!

aneezakiran07

Guide ME feature
I literally tried three approaches before landing on this one.

  • the first attempt used a fixed-position HTML card with Streamlit buttons rendered below it, relying on CSS position:fixed !important to reposition the button row to visually overlap the card. this failed because Streamlit’s component tree does not guarantee DOM order or stable class names across rerenders, so the CSS selectors targeting the button row were unreliable and the buttons kept appearing at the bottom of whichever tab was open.
  • the second attempt used links styled as buttons inside the HTML card, with a handler at the top of render that read st.query_params and updated state. this seemed promising but Streamlit treats query param URL changes as full page navigations, not rerenders. clicking Next reloaded the entire app from scratch and cleared the uploaded file.
  • then finally, added a Guide tab to the data cleaning app
    its a static checklist ,it only exports one function: render(tab).
    it takes the Streamlit tab object and renders inside it.
    in app.py, add a Guide tab to the st.tabs call and pass it to guide.render().
    this was the easiest and is done

What i wanted: was a cool guide feature that shows in bottom right and guide users like how many software works
but it was nearly impossible in streamlit for me tho :”) so i ended up making just a guide me tab :)
pushed both files on github tho so no onne can say how can u spend 1 hr 30 min on just one tab that shows only static data TT

Attachment
0
aneezakiran07

Shipped this project!

Hours: 4.97
Cookies: 🍪 109
Multiplier: 21.97 cookies/hr

HI!!

so i built a data cleaning pipeline in streamlit.

also in this ship, i made modular architecture with separate folders for cleaning logic, tab rendering, and caching because before my architecture was such a mess :”)

now the cleaning pckg is only python with zero streamlit dependency so every function is independently testable.

toughest part was the pipeline json replay. . steps that need manual column selection like email validation cannot be automated, so the replayer had to be honest about what it skipped!!!

also built a pdf report, correlation heatmap, data type guesser, live quality score, and distribution charts. Correlation heatmap only show for advanced users as simple folks really can’t understand what’s its about!

I hope you test each feature thoughly and give me your valuable feedbacks! Suggest me more functions that you think you might need

Thanks alot!!!

aneezakiran07

Hi! in this devlog i built
data quality score

  • new score from 0 to 100 shown at the top of the overview tab, computed across five dimensions each worth 20 points
  • completeness penalises missing cells proportionally, so a dataset with 10% missing values loses about 2 points on that dimension
  • uniqueness penalises duplicate rows the same way
    type consistency scans every object column and checks what fraction of values are actually numeric strings stored as text. the more columns doing this the lower the score
  • outlier cleanliness uses a 3x IQR fence instead of the std 1.5x so only extreme values count against the score, not natural spread( i hate this probabilty stuff but its so imp for data science and ML TT)
  • validity checks text columns for common placeholder strings like none, na, n/a, null, unknown, and empty strings
    , its colour coded green above 80, amber( yes we all think of that amber :”) ) between 55 and 80, and red below 55
  • five breakdown cards sit below the gauge, one per dimension, each showing the raw score out of 20, the grade, and a plain english explanation of what was found
  • the whole thing runs through @st.cache_data so it recomputes automatically whenever the dataframe changes and the user sees the score tick up in real time as they clean
0
aneezakiran07

Hi,in this devlog, i did

cleaning report PDF export

  • added new reporting.py module that builds the entire PDF using reportlab
  • NOW report includes a summary table comparing rows, columns, missing cells, and duplicate rows before and after cleaning with a change column
  • full column profile for both the original and cleaned dataset showing type, null count, null percentage, and unique value count
  • missing value breakdown table that lists every column that had nulls before or after, and flags whether it was fully resolved
  • ordered list of every cleaning step performed with the dataframe shape after each one so you can trace exactly what changed when
  • first 10 rows of the cleaned dataset as a sample table at the end
  • PDF regenerates automatically only when the history changes, using the same history_len cache key pattern already used for the Excel download
  • reportlab added to requirements.txt. so if the import fails the UI shows a pip install hint instead of crashing the whole app
Attachment
0
aneezakiran07

HI!!!
In this devlog i did!

Data Type Guesser

  • new section at bottom of clean tab that scans every column and suggests the correct type based on what the values actually contain
  • can detects nine patterns: email addresses, boolean values, currency amounts, %ages, measurement units, time durations, datetimes, plain numeric strings, and low cardinality categories
  • each suggestion shows the column name, current type, suggested type, a confidence percentage, the reason it was flagged, and three sample values so u can judge before applying
  • rendered as an editable table where you tick which suggestions to apply and press one button, so you never have to go column by column manually
  • also the confidence is computed from the proportion of values that matched the pattern, so a column where 95% of values look like currency scores higher than one where 60% do
0
aneezakiran07

Pipeline JSON Save and Reload

  • cleaning history now exports to a small pipeline.json that stores step labels only, not data

  • upload that file on any new dataset and THEE app replays every automatable step in order

  • Steps that need manual column selection like email validation or outlier capping are skipped and reported back honestly

  • Every replayed step is pushed onto the undo stack so the full workflow stays consistent

Correlation Heatmap

  • Added to the Profile tab between the missing value heatmap and the before/after comparison

  • Symmetric grid where each cell is colour encoded, blue for positive correlation, red for negative, white near zero

  • Value labels are drawn inside cells when there are 12 or fewer columns, hidden automatically on wide datasets so the chart stays readable

  • Supports Pearson, Spearman, and Kendall methods with a dropdown to switch between them

  • A summary line below the chart surfaces the three strongest off-diagonal pairs so you can spot redundant columns at a glance without reading the whole grid

  • All computation runs through @st.cache_data so switching methods is instant with no recompute

Also this is only available in ADVANCED mode, don’t want to scare off the users who don;t know much!!

0
aneezakiran07

Data Cleaning Pipeline v3.0.0

  • I figured out i should do devlogs like this from now on instead of doing long para as these are more readable

Modular Refactor

  • Broke a 1800 line single file 19 focused files across a clean folder structure. For me, understanding the code before was easy because streamlit has so less code to do but for reviewers it might not be easy so !
  • cleaning/ folder is now Python with zero Streamlit dependency so every function is independently testable
  • Each tab gets its own file with a single render() function, app.py is now just 106

Distribution Charts

  • Added histogram with KDE curve overlay to the Profile tab for numeric columns
  • IQR fence lines drawn directly on the histogram so u can visually judge outliers before deciding to cap them
    categorical columns get a horizontal frequency bar chart with percentage tooltips
  • missing value heatmap shows a sampled grid of the whole dataset, red cells are nulls, blue are present
  • All chart data goes through @st.cache_data so switching columns has no recompute cost

Column Transforms

  • Split column breaks one column into many on any delimiter, extra parts pad with NaN, source column optionally kept
  • merge columns concatenates any number of columns into one with a custom separator
  • rename columns gives an editable table so you can fix all names in one session without touching code
0
aneezakiran07

Shipped this project!

Hours: 18.07
Cookies: 🍪 184
Multiplier: 18.5 cookies/hr

HII!!!

So I finally shipped another update.

The app now includes a column profiler that shows per-column stats like min, max, mean, median, skewness, null percentage, and sample values. so users can quickly understand their data before cleaning it. I also added find & replace with optional regex support and a full undo/history system where every cleaning step is tracked and can be reverted anytime.

Another useful feature is before/after column comparison. You select a column and it shows original vs cleaned values side by side while highlighting exactly which rows changed. I also implemented column type override, allowing users to force columns into types like int, float, string, datetime, boolean, or category (booleans and integers were a bit tricky because of mapping and nullable types).

One of my favorite additions is pipeline export. The app reads the cleaning history and generates a downloadable Python script so the same cleaning steps can be reused on new datasets.

I also optimized performance using caching so expensive dataframe analysis doesn’t run on every interaction. Plus I reorganized the UI into tabs and added tooltips so even non-technical users can understand what each option does.

Test files are available on my GitHub if anyone wants to try it. Please test each feature and give me feedback :)
Thanks alot!!!

aneezakiran07

Debugging and debugging :”)
Hi!!
In this devlog, i Spent a hours doing a full code review and fixing everythin i found out is wrong!
In both smart_column_cleaner and the Recommendations tab conversion paths, converted series with a sparse dropna index were being assigned back to the full dataframe. This simply means it turns the originally null rows into NaN. The fix was applying .reindex(df_clean.index) in all conversion branches. Similar issues existed in validate_email, cap_outliers, and validate_range, where row removals left a non contiguous index and caused mismatches in before/after comparisons. These were fixed by adding .reset_index(drop=True).

moreover, KNNImputer could receive n_neighbors=0 on very small datasets.. so i added a guarded max(1, min(…, len(df_clean))) constraint . IterativeImputer also crashes when only one numeric column exists, so it now automatically falls back to KNNImputer.

the most illegal one :”) history was being pushed even when operations failed, so it now uses snapshot() and commit_history() to record only successful operations
.
find_and_replace previously cast entire columns to strings, converting NaN to “nan”. validate_date_col also used an incorrect missing value type.Also, all pandas inplace=True calls were replaced with reassignment for pandas 3.x compatibility, and Excel downloads are now cached to avoid unnecessary regeneration.

Note: I attached the video explaining each feature, tho video is rushed cuz ft dont allow uploading large videos TT

0
aneezakiran07

tooltips, readme update, and testing
in this devlog, i Added help tooltips across the Clean and Validate tabs. Every widget that needed context now has a ? icon, on hover it gives the explanation of what the option does and when to use it. I made it so that even non tech user can understand whats happening with their data!
Also Updated the README and included new features description in it. e.g the tab layout, column profiler, before/after comparison, history and undo, pipeline export, and multi-sheet Excel support. Also added one more multi-sheet excel test files.
Moreover, i performed extensive testing and made sure that my pipeline works. My app works on single sheet excel and multi sheet excel also.
If you guys have any issues with it then feedback is welcome!!!

0
aneezakiran07

Before/after comparison, column type override, and pipeline export
Hi!!!
in this devlog, three more features are added.
First was before/after column comparison, you pick a column and it shows original vs current values side by side with a third column that flags exactly which rows changed.
Second was column type override. a dropdown is given to force any column to int, float, string, datetime, boolean, or category.
it was kinda tricky TT cuz booleans needed a string-mapping step first, integers needed nullable Int64 so NaN values don’t cause a cast error, datetime needed errors=‘coerce’.
Third was pipeline export. which I think is the most useful thing. It reads through the cleaning history and writes out a proper .py script you can download and rerun on any new file.

0
aneezakiran07

UI CHANGES
HI
This devlog is mostly a styling and restructuring UI rather than new features.
The first thing was removing the big heading at the top of the page. because it was taking alot of space and isn’t necessary in the first place.
Then i made the tabs sit right at the top of the page like a website navbar.. Streamlit’s default tab styling is pretty small, so i had to override it with CSS.
Before, my UI was having a long long one webpage, which was causing the bad user experience
Now i made separate tabs for separate purposes so just by looking at the UI, user know where he will find what features!!!
Streamlit is easy (if we use its own styling) but if we try to make out own css in it :”) then its sucks :)

0
aneezakiran07

Implemented Column Profiler, find & replace and undo/history system
Hi!!
In this devlog, I added three features to the data cleaning pipeline that I felt were genuinely missing from the first version.
First was a column profiler ,table that shows you per-column stats like min, max, mean, median, std, skewness, null percentage, and sample values. Cached it with @st.cache_data so it only recomputes when the dataframe actually changes.
Second was find & replace, pick a column, type a search string, type a replacement, and optionally flip on regex mode for pattern-based replacements..
Third,, was a full undo/history system. The history panel shows every step and lets you undo one step at a time or wipe the whole history.

0
aneezakiran07

Resolved Performance Issues
HI!!
in this devlog, I optimized the app to stop re-running expensive operations on every single user interaction. The main issue was that analyze_data_issues, which runs regex loops across every column in the dataframe was firing on every button click I fixed this by wrapping it (along with the recommendations generator) into a single @st.cache_data function called get_analysis_and_recommendations. Same thing i did to the file reader and the stats calculator. Also deleted all verbose=True calls inside manual operations asthey were triggering partial rerenders.
Some users said its a bit slow, thats why i spend 3 hours , only figuring out what could be the issue, and came up with these solutions
Also, I made two modes, one simple and one advanced in the settings sidebar, by clicking to advanced mode, user can select the thresholds and model they want to implement for missing values!
in the next devlog, i will implement more functions slowly and will also make it more user friendly.

0
aneezakiran07

Shipped this project!

Hours: 7.94
Cookies: 🍪 150
Multiplier: 18.89 cookies/hr

HII!!!
So, I finally shipped this after working on it for a while and I’m happy with how it turned out(took me months :”) “).

The app now has a full Validation & Quality section with five new operations that can validate emails (flag or drop invalid ones), standardize messy phone numbers into one clean format, properly parse mixed-format dates, detect and cap outliers using IQR or Z-score with configurable thresholds, and validate value ranges to catch impossible values like age = -5 or score = 999.

The hardest part was definitely the date standardization because pandas was silently failing on most non-ISO formats, so I rewrote it from scratch to try 17 explicit formats per cell, which boosted accuracy from almost useless to actually reliable.

I also make every operation follow the same clean interaction with popup dropdowns, apply to all, checkboxes, and disabled action buttons until valid input is given
I built this because i love to do tasks related to data cleaning, so i realize why not make a tool generally for it :”)

Note: You can download test_data csv from my github to check this shipped project

aneezakiran07

HI!!!
In this devlog, I added a new Validation & Quality section in the data cleaning pipeline where I implemented five new validators to check emails, clean and standardize phone numbers, correctly parse mixed-format dates, detect and handle outliers, and validate value ranges. This update helped fix many hidden data issues, especially the date parser which now handles almost all common formats., and I also kept the UI consistent with the rest

0
aneezakiran07

HII!!
So in this devlog,
I turned the app into an intelligent assistant that actually thinks about your data.
It now scans the dataset, finds problems like duplicates, currency symbols, wrong data types, percentages as text, and missing values, and shows them as simple fix cards.
Each issue comes with a one-click “Fix This” button, and I also added checkboxes in front of AI suggestions so users can select exactly which columns they want to apply fixes to.
Added an “Auto-Fix All Issues” button that runs the full pipeline in the best order and fixes everything at once.
Now beginners can clean data without technical knowledge, while advanced users still have full control.

0
aneezakiran07

HII!!
So in this devlog,
I upgraded the website with a super-smart missing value handler. It now detects all kinds of missing data like “NA”, “?”, -999, and more, then fills them intelligently.
Numeric columns get KNN or MICE imputation depending on dataset size, while categorical columns get mode or “Missing” automatically.
Smart threshold drops columns with too many missing values, and everything can be controlled in the sidebar.
Also added a one-click “Full Pipeline” button that runs all cleaning steps in the best order, with detailed feedback showing exactly what changed.

0
aneezakiran07

HII!!
So in this devlog,
I upgraded the app into a smart data transformation system by adding intelligent string cleaning and automatic type detection.
The system now cleans text, detects patterns like currency, percentages, units, durations, and numeric values, and converts them automatically.
smart threshold system prevents wrong conversions, and users can control sensitivity using a settings sidebar.
also added a one-click “Run Basic Pipeline” button and detailed feedback showing exactly what was converted.

0
aneezakiran07

HI!!!
SO in this devlog, I improved the data cleaning pipeline with a cleaner UI and a real-time statistics dashboard showing rows, columns, missing cells, duplicates, and data types.
I added a flexible preview slider (5–50 rows), a collapsible column info panel, export options (CSV & Excel), and a reset button to restore the original dataset instantly.

0
aneezakiran07

I’ve built a Streamlit UI that lets you upload a CSV and instantly clean it by dropping duplicate rows/columns and stripping extra spaces from text. User will choose what functions he want to run using the provided buttons. I already pushed the code for these three core functions and added a data preview so you can see the results immediately. For the next session, I’m moving on to removing special characters and fixing missing values.

Attachment
Attachment
0
aneezakiran07

I spent an hour setting up a Streamlit interface to handle the tedious parts of data cleaning. I wrote three core functions that take any uploaded CSV and automatically fix it: one to strip hidden whitespaces from text, one to drop duplicate rows, and a third to find and remove identical columns. The goal was to make something generic so I don’t have to manually clean files every time I start a new project. It’s simple, fast, and handles the “dirty” data work in one click.

Attachment
0
aneezakiran07

I’m working on my first project! This is so exciting. I can’t wait to share more updates as I build.

Attachment
0