EndAI – Devlog #1
Hey! This is the first devlog for EndAI.
This project is a local AI chat system that runs GGUF models using llama.cpp through Python. The goal is to load models, chat with them, and manage sessions in one simple server.
What’s done so far:
The backend is written in Python using Flask. It can load GGUF models and run them locally. It detects your hardware (CUDA, ROCm, Metal, or CPU) and automatically decides how to use it for better performance.
You can load and unload models without restarting the server. There is also a basic downloader to fetch models in the background and track progress while they download.
I added multiple prompt templates so different model formats work properly (like ChatML, Llama 2, Llama 3, Mistral, Alpaca). There is also a simple token counter and a system to trim long chats so the model does not run out of context.
Chat sessions are saved in a JSON file so you don’t lose your conversations.
What’s next:
More stability and cleanup. The system works, but it still needs polishing and better structure before I expand it further.
Why it took longer:
This project took longer than expected because llama.cpp didn’t behave properly at first. I also mixed different languages and approaches while building it, so I had to rewrite parts of the code from scratch after realizing it was getting messy and inconsistent. Basically a lot of trial, error, and fixing my own confusion.
That’s it for now. I will continue coding! Cya!