Training an LLM from scratch that has all the necessary information about survival, health and medicine
Training an LLM from scratch that has all the necessary information about survival, health and medicine
Seems like I’ve hit a wall, rather the wall has hit me. The model is producing incomprehensible text. I’ve come to understand that it’s a scaling problem. Right now, all the data together is ~30 MB. I need to gather wayy more data, more importantly gather clean data. Seeing the Loss drop did feel good, the model is doing the best it can with the amount of data it has but it doesnt have too much to work on. Regardless, I’m not abandoning this, I hope I can come up with and find more data for it to be comprehensible!! Onwards and Upwards we go!!!!
Log in to leave a comment
Ok so, I completed the tokenizer. And followed that up by writing and training the model which took a longg time to to train (devlog time does not count for time spent on training). The vocab size is 8000 words and the maximum sequence allowed is 512. All of this was done with PyTorch (No HuggingFace). And after all of that effort, the output I got is in the picture (it is just text that does not make sense)…
Really regret adding Shakespearean text in the training data but hey… you live and you learn!
Log in to leave a comment
Ok so I finally got the data together, it took a lot of time but it’s done. Now the actual work. begins. I got the transformers trained, and it is producing a sane amount of tokens and common words have stayed intact. For the data gathering, it was split into 3 parts, Survival Data (5.5MB), Warm-up Data (28.1MB)[English Language], Instruction Data (254KB)[QnA formatted Data]. And it was all combined into a tokenizer corpus (one large .txt file) and trained using sentence piece, next up is the architecture and the actual implementation! I’m really excited!!
Log in to leave a comment
To say finding and cleaning and processing data for training has been difficult would be an understatement. There’s always the fear of is this enough? Is this useful? Will this be enough data? And so far, it has been very messy but step by step I think I’m getting closer to actually useable data. However, the folder strucutre is still a mess. An LLM really needs a lot of data and for survival tips, I have been using Wikipedia and openly available survival manuals and PDFs. Data gathering is a really important step and I dont want to take it for granted so let’s hope all of this works!!
Log in to leave a comment