A SOTA Nepali LLM
A SOTA Nepali LLM
Training and Tokenizing data on kaggle. ( connected remote session of kaggle to Vscode for time tracking )
There were some problems for example:
Log in to leave a comment
Trained Tokenizer and this is the result:
Vocab size: 32000
Vocab size: 32000
Vocab size: 32000
Measuring fertility on 10000 documents…
Total words: 2,414,517
Total tokens: 2,919,045
Fertility: 1.209 tokens/word
NOW THE CRAZY PARTTT:
my tokenizer just uses 5 tokens when Mistral-7b or GPT-2 uses 45 tokens which is 9x times better
ON GPT 5 it beats it by 1.5-2% on NEPALI TEXT ( HOLY SMOKESSS YAYYYYY)
Log in to leave a comment
Collected raw corpus and voice data of Nepali language totalling about 30 GB’s and cleaned the text data bringing cleaned data size to 6GB
Log in to leave a comment