HamroAI banner

HamroAI

4 devlogs
16h 30m 16s

A SOTA Nepali LLM

aayushydvnp

Training and Tokenizing data on kaggle. ( connected remote session of kaggle to Vscode for time tracking )

There were some problems for example:

  1. Whole deduped dataset was loaded at once which caused the GPU to be out of memory and i recevied Cuda error so i am planning to train with partitions
  2. Kaggle has a limit of 12 hours and after the session ends the whole data is wiped out so i had to figure out to save it temporarily and build the training on checkpoints rather doing it at once. ( used Hugginface and kaggle persistance).
  3. I also had exams during this time
Attachment
Attachment
0
aayushydvnp

Trained Tokenizer and this is the result:
Vocab size: 32000
Vocab size: 32000
Vocab size: 32000

Measuring fertility on 10000 documents…
Total words: 2,414,517
Total tokens: 2,919,045
Fertility: 1.209 tokens/word

NOW THE CRAZY PARTTT:
my tokenizer just uses 5 tokens when Mistral-7b or GPT-2 uses 45 tokens which is 9x times better
ON GPT 5 it beats it by 1.5-2% on NEPALI TEXT ( HOLY SMOKESSS YAYYYYY)

Attachment
0
aayushydvnp

Collected raw corpus and voice data of Nepali language totalling about 30 GB’s and cleaned the text data bringing cleaned data size to 6GB

Attachment
Attachment
0