Title: FloatLLM Update: FIRST OUTPUT! 🚀 Offline Tokenizer, VRAM Zero-Copy Loop, and Backend Expansion
A magic happened and I saw the first output from engine running the AI!
Status Update:
The engine is officially alive. After mapping the memory safely in the last update, I had to build the translation layer and wire it into the bare-metal GPU graph. I hit a brutal Apple Metal segmentation fault when trying to run a continuous loop, but after reverse-engineering how the GPU allocator handles raw Python RAM, I finally broke through the system.
What I Built:
100% Offline Tokenizer (Phase 4): Built a universal gguf tokenizer that dynamically extracts the vocabulary array directly from the model’s metadata headers. It takes virtually zero RAM, meaning it can flawlessly tokenize a massive 405B model exactly as fast as a tiny 1B model, entirely offline.
The ctypes Memory Bridge: Successfully mapped the Python integer arrays (Token IDs) to strict C-compatible 32-bit integer pointers, allowing the Python router to seamlessly pass user prompts into the C++ math graph without blocking the main thread.
VRAM Detachment & Loop Fixes (The Segfault Slayer): Encountered a massive crash because Apple Metal strictly forbids reading raw Python RAM. Fixed it by explicitly detaching the CPU memory pointers, allowing the GPU to allocate safe VRAM, uploading the zero-copy data, and then restoring the pointers for the next generation loop. Also had to nullify the global buffer states so the allocator wouldn’t panic during continuous generation. This required a lot of help from AI to debug the issue.
Backend Expansion: Expanded the C++ router to dynamically map and compile for OpenCL, SYCL (Intel OneAPI), Kompute (DirectX), and Native ARM/CPU environments.
The Result:
I bypassed the hidden attention layers just to test the raw pipeline. The C++ engine successfully grabbed the prompt, executed the matrix multiplication on the MTL0 hardware, ran the Greedy Argmax sampling, and streamed the tokens back to the Python terminal in real-time at lightning speed!
Because the “brain” (Attention) isn’t wired yet, it hallucinated pure gibberish—but it is magnificent, mathematically perfect gibberish without a single memory leak:
Output:
[FloatLLM(C++)] --- RECEIVING USER PROMPT ---
[FloatLLM(C++)] Python sent 8 mapped mathematical tokens across the bridge.
[FloatLLM(C++)] Raw Integer Payload: [ 128000 7852 754 564 13734 636 19650 66 ]
[FloatLLM] --------------------------------------------------------------------------------
[FloatLLM] Engine successfully mapped. Handing to AI...
User: What is the capital of France?
FloatLLM: brunette Viewer -country factory .repeat ious entre -tabs iring