A SOTA Nepali LLM. Some time were accounted for Model training. PLEASE READ THE GITHUB README
Github copilot for debugging error related to runtime and imports error. Wrapper was also built with AI
A SOTA Nepali LLM. Some time were accounted for Model training. PLEASE READ THE GITHUB README
Github copilot for debugging error related to runtime and imports error. Wrapper was also built with AI
Fixed the output by the model. On some inputs it was outputting giberish stuff.
Push update:
There was output error which was causing jibberish output instead of output. Fixed it
Log in to leave a comment
I’m shipping the current HamroAI milestone with the model and dataset now published on Hugging Face, plus a cleaner Colab-ready inference flow.
The hardest part was not just training, but making the full pipeline reliable: dependency conflicts, runtime incompatibilities, tokenizer/model loading issues, and deployment consistency. I spent time debugging each stage, simplifying the workflow, and turning it into a repeatable process so future training runs are much smoother.
Please note: at this stage, the model still produces some gibberish outputs in parts. I plan to improve this as I get access to more compute.
Current limitations:
Free compute is not enough for the training scale I originally intended, so results are not yet as competent as I want.
I have basic ML/DL knowledge, but this project showed me how much deeper this gets in real-world training and deployment.
This does not mean I’m dropping the project. I’m taking time to strengthen my concepts, then return with stronger execution and better training quality.
I also want to apologize for the delay in major model updates. I have already streamlined most of the process, and the key remaining step is larger-scale training. I will push improved versions as soon as I have better compute access.
Update
Today I shipped a release for HamroAI.
Changelog :
1) I uploaded the latest LoRA model to Hugging Face, published the dataset there, and cleaned up the project documentation so people can actually use the model without guesswork.
2) I also added a Colab-ready inference notebook and linked it directly from the README with an Open in Colab button, so anyone can test the model in a few clicks.
. Overall, this update moved HamroAI from an internal training setup to something shareable, reproducible, and much easier for others to try.
Model: https://huggingface.co/darksunnp/hamroai-nepali-lora-v1
Dataset: https://huggingface.co/datasets/darksunnp/hamroai
Log in to leave a comment
“Every minute you spend fixing a symptom is a minute stolen from curing the disease.”
Fixed the tokenizer having the issue.
Now it works! hooray!
I also fixed a few packages issue entirely instead of applying patches which had greatly reduced the speed of execution
HOWEVER! The output is still not something desirable as sometimes it just repeats the first token and mismatches word with different context. So will be working on that from now.
Changelog
Fixed the tokenizer as it was Merging in the wrong order
fixed the imports ( had like 6-7 patches for imports so removed those and implemented single cell for imports reducing clutter)
In the screenshot: You will see the fixed BPE tokenizer implemented and in kaggle runtime
Log in to leave a comment
“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.”
— Abraham Lincoln
I realized this half way into my project. Even though i was running code blocks and yes it was working but i wasnt understanding what it was doing on the fundamental level. Like how does Qlora even fine tune an LLM. What should i change to make it better, what are the fundamental reasons for the errors. I wasnt grasping the essence of the project. Yes,while everyone uses AI for debugging we often forget that we must understand what we are doing and not offload our thinking to AI.
So i took some time out and went into learning the fundamentals of ML and NLP itself and i will continue to do as knowing what are doing feels way better than “okay this works”.
Log in to leave a comment
Train model from checkpoints AND GOT FIRST OUTPUT FROM MY MODEL. But it was nothing meaningful so i had to tweak and change the parameters and use my own tokenizer and i am still training to see what happens. Stay updated guys
Log in to leave a comment
Training and Tokenizing data on kaggle. ( connected remote session of kaggle to Vscode for time tracking )
There were some problems for example:
Log in to leave a comment
Trained Tokenizer and this is the result:
Vocab size: 32000
Vocab size: 32000
Vocab size: 32000
Measuring fertility on 10000 documents…
Total words: 2,414,517
Total tokens: 2,919,045
Fertility: 1.209 tokens/word
NOW THE CRAZY PARTTT:
my tokenizer just uses 5 tokens when Mistral-7b or GPT-2 uses 45 tokens which is 9x times better
ON GPT 5 it beats it by 1.5-2% on NEPALI TEXT ( HOLY SMOKESSS YAYYYYY)
Log in to leave a comment
Collected raw corpus and voice data of Nepali language totalling about 30 GB’s and cleaned the text data bringing cleaned data size to 6GB
Log in to leave a comment