Activity

aayushydvnp

Training and Tokenizing data on kaggle. ( connected remote session of kaggle to Vscode for time tracking )

There were some problems for example:

  1. Whole deduped dataset was loaded at once which caused the GPU to be out of memory and i recevied Cuda error so i am planning to train with partitions
  2. Kaggle has a limit of 12 hours and after the session ends the whole data is wiped out so i had to figure out to save it temporarily and build the training on checkpoints rather doing it at once. ( used Hugginface and kaggle persistance).
  3. I also had exams during this time
Attachment
Attachment
0
aayushydvnp

Trained Tokenizer and this is the result:
Vocab size: 32000
Vocab size: 32000
Vocab size: 32000

Measuring fertility on 10000 documents…
Total words: 2,414,517
Total tokens: 2,919,045
Fertility: 1.209 tokens/word

NOW THE CRAZY PARTTT:
my tokenizer just uses 5 tokens when Mistral-7b or GPT-2 uses 45 tokens which is 9x times better
ON GPT 5 it beats it by 1.5-2% on NEPALI TEXT ( HOLY SMOKESSS YAYYYYY)

Attachment
0
aayushydvnp

Collected raw corpus and voice data of Nepali language totalling about 30 GB’s and cleaned the text data bringing cleaned data size to 6GB

Attachment
Attachment
0
aayushydvnp

Shipped this project!

Hours: 15.78
Cookies: 🍪 248
Multiplier: 15.69 cookies/hr

I built a platform where students who are interested in participating in IOAI can gather and compete.
Main features:

  1. Curated resources
  2. Kaggle competitions and submission
  3. Submissions of top scores ( will add these as I solve the questions )

Hardest part:

  1. API integration ( Kaggle is kinda weird and its API integration is hard)
  2. finding what to hide and what to keep when uploading it on github.

The best part of this:

  1. I was learning side by side when building this website so i also got to learn a lot of ML and improved my problem solving.
aayushydvnp

Implemented Kaggle API. We can connect Kaggle API and we can see the current competitions running . We can also see our past 5 submissions for each competition and what score we got.

Attachment
Attachment
0
aayushydvnp

Build core principles from core python to enable students to see the basic backbone for all core implementations. This is so that they know what is actually going in when using libraries and understand the core of execution rather than just blindly using libraries. This will also help in error debugigng.

Attachment
0
aayushydvnp

Solved previous year questions to upload on the website and got better score than official solution. Uploaded the code on kaggle too to enable easy running

Attachment
Attachment
0
aayushydvnp

Added Resources from various websites and scattered media and arranged it in a chronoligical way for students to easily follow the syllabus and learn.

Attachment
0
aayushydvnp

Worked on website and uploaded links and resources to the website. I also worked on few solution to the kaggle problems to help guide learners navigate contests.

NEXT WORK:

  1. improve website
  2. add resouirces
  3. upload contest and work on contest mechanism
Attachment
Attachment
0
aayushydvnp

Made a better solution to the problem with a XGBOOST algorithm. Got the highest score in the leaderboard

Attachment
1

Comments

kuratus89
kuratus89 about 2 months ago

nice work bro!

aayushydvnp

Prepared baseline solution for the first problem which has both a ipynb notebook and a kaggle notebook which will be made public soon and this solution will soon be uploaded to the github page.

Attachment
0