Shipped this project!
Fixed the output by the model. On some inputs it was outputting giberish stuff.
Fixed the output by the model. On some inputs it was outputting giberish stuff.
I build a cat that judges your code ( brutally) using HACKCLUB AI API. This will make you a good comment writer ( yea your ones were bad )
Update:
THE WEBSITE IS LIVE AND WORKINGGGG
Changelog :
Completed website and deployed it on vercel
Fixed bugs during deployment.
Changed UI
Next Steps:
NOTHING ITS LIVE
Log in to leave a comment
Update:
Worked on website design and favicon design ( FIXING BUGGGSSSS OMGG)
Changelog :
Working on website and UI
Fixed UI bugs
Fixed API bugs
Applied 2 pass system ( evaluator + judge)
Next Steps:
Complete the website and deploy it on web.
Log in to leave a comment
Update:
Worked on website design and favicon design
Changelog :
Working on website and UI
Make sure the JSON output works.
Next Steps:
Complete the website and run tests to ensure that the commenter works.
Log in to leave a comment
Update:
Worked on website design and favicon design
Changelog :
Log in to leave a comment
Update
Successfully connected to the HACKCLUB AI API.
Changelog :
Next Steps:
Log in to leave a comment
Update
I prepared the documentation and formulated plan to build the project.
Changelog :
. Overall, This marks the beginning of the project.
Log in to leave a comment
Push update:
There was output error which was causing jibberish output instead of output. Fixed it
Log in to leave a comment
I’m shipping the current HamroAI milestone with the model and dataset now published on Hugging Face, plus a cleaner Colab-ready inference flow.
The hardest part was not just training, but making the full pipeline reliable: dependency conflicts, runtime incompatibilities, tokenizer/model loading issues, and deployment consistency. I spent time debugging each stage, simplifying the workflow, and turning it into a repeatable process so future training runs are much smoother.
Please note: at this stage, the model still produces some gibberish outputs in parts. I plan to improve this as I get access to more compute.
Current limitations:
Free compute is not enough for the training scale I originally intended, so results are not yet as competent as I want.
I have basic ML/DL knowledge, but this project showed me how much deeper this gets in real-world training and deployment.
This does not mean I’m dropping the project. I’m taking time to strengthen my concepts, then return with stronger execution and better training quality.
I also want to apologize for the delay in major model updates. I have already streamlined most of the process, and the key remaining step is larger-scale training. I will push improved versions as soon as I have better compute access.
Update
Today I shipped a release for HamroAI.
Changelog :
1) I uploaded the latest LoRA model to Hugging Face, published the dataset there, and cleaned up the project documentation so people can actually use the model without guesswork.
2) I also added a Colab-ready inference notebook and linked it directly from the README with an Open in Colab button, so anyone can test the model in a few clicks.
. Overall, this update moved HamroAI from an internal training setup to something shareable, reproducible, and much easier for others to try.
Model: https://huggingface.co/darksunnp/hamroai-nepali-lora-v1
Dataset: https://huggingface.co/datasets/darksunnp/hamroai
Log in to leave a comment
“Every minute you spend fixing a symptom is a minute stolen from curing the disease.”
Fixed the tokenizer having the issue.
Now it works! hooray!
I also fixed a few packages issue entirely instead of applying patches which had greatly reduced the speed of execution
HOWEVER! The output is still not something desirable as sometimes it just repeats the first token and mismatches word with different context. So will be working on that from now.
Changelog
Fixed the tokenizer as it was Merging in the wrong order
fixed the imports ( had like 6-7 patches for imports so removed those and implemented single cell for imports reducing clutter)
In the screenshot: You will see the fixed BPE tokenizer implemented and in kaggle runtime
Log in to leave a comment
“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.”
— Abraham Lincoln
I realized this half way into my project. Even though i was running code blocks and yes it was working but i wasnt understanding what it was doing on the fundamental level. Like how does Qlora even fine tune an LLM. What should i change to make it better, what are the fundamental reasons for the errors. I wasnt grasping the essence of the project. Yes,while everyone uses AI for debugging we often forget that we must understand what we are doing and not offload our thinking to AI.
So i took some time out and went into learning the fundamentals of ML and NLP itself and i will continue to do as knowing what are doing feels way better than “okay this works”.
Log in to leave a comment
Train model from checkpoints AND GOT FIRST OUTPUT FROM MY MODEL. But it was nothing meaningful so i had to tweak and change the parameters and use my own tokenizer and i am still training to see what happens. Stay updated guys
Log in to leave a comment
Training and Tokenizing data on kaggle. ( connected remote session of kaggle to Vscode for time tracking )
There were some problems for example:
Log in to leave a comment
Trained Tokenizer and this is the result:
Vocab size: 32000
Vocab size: 32000
Vocab size: 32000
Measuring fertility on 10000 documents…
Total words: 2,414,517
Total tokens: 2,919,045
Fertility: 1.209 tokens/word
NOW THE CRAZY PARTTT:
my tokenizer just uses 5 tokens when Mistral-7b or GPT-2 uses 45 tokens which is 9x times better
ON GPT 5 it beats it by 1.5-2% on NEPALI TEXT ( HOLY SMOKESSS YAYYYYY)
Log in to leave a comment
Collected raw corpus and voice data of Nepali language totalling about 30 GB’s and cleaned the text data bringing cleaned data size to 6GB
Log in to leave a comment
I built a platform where students who are interested in participating in IOAI can gather and compete.
Main features:
Hardest part:
The best part of this:
Implemented Kaggle API. We can connect Kaggle API and we can see the current competitions running . We can also see our past 5 submissions for each competition and what score we got.
Log in to leave a comment
Build core principles from core python to enable students to see the basic backbone for all core implementations. This is so that they know what is actually going in when using libraries and understand the core of execution rather than just blindly using libraries. This will also help in error debugigng.
Log in to leave a comment
Solved previous year questions to upload on the website and got better score than official solution. Uploaded the code on kaggle too to enable easy running
Log in to leave a comment
Added Resources from various websites and scattered media and arranged it in a chronoligical way for students to easily follow the syllabus and learn.
Log in to leave a comment
Worked on website and uploaded links and resources to the website. I also worked on few solution to the kaggle problems to help guide learners navigate contests.
NEXT WORK:
Log in to leave a comment
Made a better solution to the problem with a XGBOOST algorithm. Got the highest score in the leaderboard
Prepared baseline solution for the first problem which has both a ipynb notebook and a kaggle notebook which will be made public soon and this solution will soon be uploaded to the github page.
Log in to leave a comment