HamroAI banner

HamroAI

9 devlogs
27h 45m 11s

A SOTA Nepali LLM. Some time were accounted for Model training. PLEASE READ THE GITHUB README

This project uses AI

Github copilot for debugging error related to runtime and imports error. Wrapper was also built with AI

Demo Repository

Loading README...

aayushydvnp

Shipped this project!

Fixed the output by the model. On some inputs it was outputting giberish stuff.

aayushydvnp

Push update:
There was output error which was causing jibberish output instead of output. Fixed it

Attachment
0
aayushydvnp

Shipped this project!

Hours: 40.09
Cookies: 🍪 882
Multiplier: 22.01 cookies/hr

I’m shipping the current HamroAI milestone with the model and dataset now published on Hugging Face, plus a cleaner Colab-ready inference flow.

The hardest part was not just training, but making the full pipeline reliable: dependency conflicts, runtime incompatibilities, tokenizer/model loading issues, and deployment consistency. I spent time debugging each stage, simplifying the workflow, and turning it into a repeatable process so future training runs are much smoother.

Please note: at this stage, the model still produces some gibberish outputs in parts. I plan to improve this as I get access to more compute.

Current limitations:
Free compute is not enough for the training scale I originally intended, so results are not yet as competent as I want.
I have basic ML/DL knowledge, but this project showed me how much deeper this gets in real-world training and deployment.
This does not mean I’m dropping the project. I’m taking time to strengthen my concepts, then return with stronger execution and better training quality.

I also want to apologize for the delay in major model updates. I have already streamlined most of the process, and the key remaining step is larger-scale training. I will push improved versions as soon as I have better compute access.

aayushydvnp

Update
Today I shipped a release for HamroAI.
Changelog :
1) I uploaded the latest LoRA model to Hugging Face, published the dataset there, and cleaned up the project documentation so people can actually use the model without guesswork.
2) I also added a Colab-ready inference notebook and linked it directly from the README with an Open in Colab button, so anyone can test the model in a few clicks.

. Overall, this update moved HamroAI from an internal training setup to something shareable, reproducible, and much easier for others to try.

Model: https://huggingface.co/darksunnp/hamroai-nepali-lora-v1
Dataset: https://huggingface.co/datasets/darksunnp/hamroai

Attachment
0
aayushydvnp

“Every minute you spend fixing a symptom is a minute stolen from curing the disease.”
Fixed the tokenizer having the issue.
Now it works! hooray!
I also fixed a few packages issue entirely instead of applying patches which had greatly reduced the speed of execution

HOWEVER! The output is still not something desirable as sometimes it just repeats the first token and mismatches word with different context. So will be working on that from now.

Changelog
Fixed the tokenizer as it was Merging in the wrong order
fixed the imports ( had like 6-7 patches for imports so removed those and implemented single cell for imports reducing clutter)

In the screenshot: You will see the fixed BPE tokenizer implemented and in kaggle runtime

Attachment
0
aayushydvnp

“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.”
— Abraham Lincoln

I realized this half way into my project. Even though i was running code blocks and yes it was working but i wasnt understanding what it was doing on the fundamental level. Like how does Qlora even fine tune an LLM. What should i change to make it better, what are the fundamental reasons for the errors. I wasnt grasping the essence of the project. Yes,while everyone uses AI for debugging we often forget that we must understand what we are doing and not offload our thinking to AI.

So i took some time out and went into learning the fundamentals of ML and NLP itself and i will continue to do as knowing what are doing feels way better than “okay this works”.

Attachment
Attachment
0
aayushydvnp

Train model from checkpoints AND GOT FIRST OUTPUT FROM MY MODEL. But it was nothing meaningful so i had to tweak and change the parameters and use my own tokenizer and i am still training to see what happens. Stay updated guys

Attachment
Attachment
Attachment
Attachment
0
aayushydvnp

Training and Tokenizing data on kaggle. ( connected remote session of kaggle to Vscode for time tracking )

There were some problems for example:

  1. Whole deduped dataset was loaded at once which caused the GPU to be out of memory and i recevied Cuda error so i am planning to train with partitions
  2. Kaggle has a limit of 12 hours and after the session ends the whole data is wiped out so i had to figure out to save it temporarily and build the training on checkpoints rather doing it at once. ( used Hugginface and kaggle persistance).
  3. I also had exams during this time
Attachment
Attachment
0
aayushydvnp

Trained Tokenizer and this is the result:
Vocab size: 32000
Vocab size: 32000
Vocab size: 32000

Measuring fertility on 10000 documents…
Total words: 2,414,517
Total tokens: 2,919,045
Fertility: 1.209 tokens/word

NOW THE CRAZY PARTTT:
my tokenizer just uses 5 tokens when Mistral-7b or GPT-2 uses 45 tokens which is 9x times better
ON GPT 5 it beats it by 1.5-2% on NEPALI TEXT ( HOLY SMOKESSS YAYYYYY)

Attachment
0
aayushydvnp

Collected raw corpus and voice data of Nepali language totalling about 30 GB’s and cleaned the text data bringing cleaned data size to 6GB

Attachment
Attachment
0