A small MOE based large language model pretraining

I am enrolled in a ai research program for high school students where they teach about llm internals how to train them and publish papers related to that. we were taught about a type of llm architecture MOE so i coded a small moe architecture wit…

I am enrolled in a ai research program for high school students where they teach about llm internals how to train them and publish papers related to that. we were taught about a type of llm architecture MOE so i coded a small moe architecture with null experts and trained on a tiny dataset . the model size is 29M.

This project uses AI

i was running in to some gpu bottlenecks i used claude to find it out and to generate the banner

Demo Repository