FloatLLM banner

FloatLLM

6 devlogs
37h 22m 33s

​A universal, memory-aware LLM inference engine designed to execute massive models under extreme hardware constraints. By utilizing dynamic memory slicing and a custom C++ backend, FloatLLM can theoretically run a 405B parameter model on a device …

​A universal, memory-aware LLM inference engine designed to execute massive models under extreme hardware constraints. By utilizing dynamic memory slicing and a custom C++ backend, FloatLLM can theoretically run a 405B parameter model on a device with just 4GB of RAM. Similar in concept to AirLLM, but entirely hardware-agnostic—it natively supports both CPU and GPU execution without being locked into specific ecosystems like CUDA.

This project uses AI

Utilized Gemini AI as a crucial debugging partner for the hardest technical hurdles like debugging OS-specific build errors, resolving macOS ggml bug. Gemini was actively used to assist with troubleshooting the bare-metal C++ inference engine, refining tensor management in the Python loader, and debugging the tokenizer. Google Search AI overviews for researching core concepts and discovering cross-platform C++ libraries. Used Gemini to polish up my grammar mistakes in README.md :)

Demo Repository

Loading README...

Daemon

Shipped this project!

I built an highly local ai engine that can support low-end device as well as act as an protection when used to run AI models as it can run large AI models around 405B models on a 4GB ram device due to it’s dynamic ram usage and control over system.

Daemon

Title: FloatLLM Update: FIRST OUTPUT! 🚀 Offline Tokenizer, VRAM Zero-Copy Loop, and Backend Expansion

A magic happened and I saw the first output from engine running the AI!

Status Update:

The engine is officially alive. After mapping the memory safely in the last update, I had to build the translation layer and wire it into the bare-metal GPU graph. I hit a brutal Apple Metal segmentation fault when trying to run a continuous loop, but after reverse-engineering how the GPU allocator handles raw Python RAM, I finally broke through the system.

What I Built:

100% Offline Tokenizer (Phase 4): Built a universal gguf tokenizer that dynamically extracts the vocabulary array directly from the model’s metadata headers. It takes virtually zero RAM, meaning it can flawlessly tokenize a massive 405B model exactly as fast as a tiny 1B model, entirely offline.

The ctypes Memory Bridge: Successfully mapped the Python integer arrays (Token IDs) to strict C-compatible 32-bit integer pointers, allowing the Python router to seamlessly pass user prompts into the C++ math graph without blocking the main thread.

VRAM Detachment & Loop Fixes (The Segfault Slayer): Encountered a massive crash because Apple Metal strictly forbids reading raw Python RAM. Fixed it by explicitly detaching the CPU memory pointers, allowing the GPU to allocate safe VRAM, uploading the zero-copy data, and then restoring the pointers for the next generation loop. Also had to nullify the global buffer states so the allocator wouldn’t panic during continuous generation. This required a lot of help from AI to debug the issue.

Backend Expansion: Expanded the C++ router to dynamically map and compile for OpenCL, SYCL (Intel OneAPI), Kompute (DirectX), and Native ARM/CPU environments.

The Result:

I bypassed the hidden attention layers just to test the raw pipeline. The C++ engine successfully grabbed the prompt, executed the matrix multiplication on the MTL0 hardware, ran the Greedy Argmax sampling, and streamed the tokens back to the Python terminal in real-time at lightning speed!

Because the “brain” (Attention) isn’t wired yet, it hallucinated pure gibberish—but it is magnificent, mathematically perfect gibberish without a single memory leak:

Output:

[FloatLLM(C++)] --- RECEIVING USER PROMPT ---
[FloatLLM(C++)] Python sent 8 mapped mathematical tokens across the bridge.
[FloatLLM(C++)] Raw Integer Payload: [ 128000 7852 754 564 13734 636 19650 66 ]
[FloatLLM] --------------------------------------------------------------------------------
[FloatLLM] Engine successfully mapped. Handing to AI...

User: What is the capital of France?
FloatLLM: brunette Viewer -country factory .repeat ious entre -tabs iring
0
Daemon

Title: FloatLLM Update: Segfault issue(fixed), 4D Tensor and mmap implementation, and dynamic memory allocation

I was about to complete it but due to constantly updating Devlog text, I have to push the code so I can resume again with my work.

Status Update:
I dealt with segmentation faults, and the issue was a headache that took about 4hrs to fix it. At the peak of chaos, everything was falling towards CPU instead of GPU. But somehow on reverting it using git and re-implementations with tensor 4D shape and dynamic memory allocation worked!

What I Built:
I used the CMake to bridge the ggml library of c++ with current hardware it is running on. Added the builds for the Apple, Cuda and,Vulkan as current support.
I used 1D Tensor first but due to matrix multiplication and the newer models developing so fast I used the 4D tensor (ne0, ne1, ne2, ne3) (so in future I could implement this same idea in the images too). Since I am using ggml so it also solves the issue for 3D and 2D tensor.
I used no_alloc = true so system can work perfectly without taking up much ram.
Due to using dynamic_mem_size it can load even the llama 3 405B instructions.

The Result:
I broke through. The engine dynamically calculates the failsafe RAM boundaries, slices the massive model into required number of safe chunks, and perfectly maps the 4D tensors straight into the GPU/CPU without a single memory leak or crash. e.g.

[FloatLLM C++] Mapped output_norm.weight| Shape: [4096, 1, 1, 1]| Target hardware: MTL0
[FloatLLM] Chunk 7 Executed. Hardware link closed.
[FloatLLM] --------------------------------------------------------------------------------
[FloatLLM] Engine successfully mapped.
[FloatLLM(C++)] Assembling computation graph...
[FloatLLM(C++)] Graph memory reserved. Pushing data to compute cores...
[Float(C++)] Matrix math complete. Target hardware stabilized.

Note: Since the time limit hit about 10hrs so I had to push it and so currently it doesn’t support much dynamic system to build cmake on every hardware so I will fix this in the next Devlog. So mimimum required hardware ram due to cmake is more than 4Gb. But somehow I found way to make it work using even less by using this command for my android :

rm -rf build

cmake -B build \
  -DCMAKE_BUILD_TYPE=MinSizeRel \
  -DGGML_VULKAN=OFF \
  -DGGML_DIR=../ggml \
  -DCMAKE_C_FLAGS="-Os -g0" \
  -DCMAKE_CXX_FLAGS="-Os -g0" \
  -DCMAKE_EXE_LINKER_FLAGS="-Wl,--no-keep-memory"

cmake --build build --config MinSizeRel -j 1
Attachment
Attachment
Attachment
Attachment
0
Daemon

Title: FloatLLM Major Update: Zero-Copy C++ Bridge & Dynamic Chunking Locked In

Status Update:
I’ve taken a massive pivot from the original architecture. I completely scrapped the AirLLM-style “layer-by-layer” static swapping approach because the disk I/O bottleneck was too severe and slow.

Instead, FloatLLM is now operating on Dynamic Zero-Copy Memory Chunking.

What I Built:

  1. The Python Memory Mapper: The engine now parses .gguf metadata directly, interrogates the host hardware limits, and mathematically groups hundreds of neural network tensors into massive, safe “Execution Blocks” (e.g., dynamically slicing a 4.4GB Llama 3 model into 5 strict RAM chunks specified by the user).
  2. The C++ Cross-Language Bridge: I bypassed Python’s high-level memory locks to extract the raw, read-only OS memory addresses of those mmap chunks. Using ctypes, Python successfully fires those pointers across the language barrier into a custom, compiled C++ backend.

The Result:
I successfully stress-tested the bridge. The engine mapped 291 Llama 3 tensors, grouped them, and streamed the raw C-pointers to the C++ engine without a single memory leak, Out-Of-Memory panic, or segfault.

The failsafe foundation is bulletproof. Next: Importing the ggml tensor library to turn those raw C-pointers into actual matrix multiplication!

0
Daemon

Fixed macOS storage quirks regarding physical vs. purgeable space; the hardware router is now officially bulletproof. The engine auto-detects everything from Apple Silicon to Linux Vulkan GPUs. I’ve implemented strict OOM/storage failsafes for 40GB+ models and added custom CLI flags for dynamic/AOT quantization. The custom system architecture and pre-flight dashboard are now fully validated.

Attachment
Attachment
Attachment
0
Daemon

I wrote the core routing logic for FloatLLM (floatllm_router.py) and built a dual-purpose memory dashboard and runtime interceptor. The dashboard keeps the user aware of their memory usage. Meanwhile, the runtime interceptor watches the crash threshold while the model is loading. If the RAM usage crosses that limit, instead of hard-crashing, the system safely stops and kills itself. This prevents any data corruption that would happen if it got forcefully killed by the Low Memory Killer Daemon (LMKD)! I have also verified that it runs on both of available devices that I have.

Attachment
Attachment
0
Daemon

The current landscape of local AI is too dependent on massive VRAM and dedicated CUDA cores. I’m building FloatLLM a hardware agnostic engine to fix that! Though there exist the popular llama.cpp and many other engines, the problem with them is that they uses a lot of RAM! You can’t even run a 4B or any model like that on a device with 4GB ram. You must use quantization otherwise you won’t be able to run the AI on a ram low as 4GB. This quantization results in loss of accuracy and knowledge of AI models, that makes them hallucinate. The goal of FloatLLM is to achieve a 100% offline, hardware-agnostic inference engine that dynamically scales to fit whatever device it’s running on.

Right now, I’m laying the architectural groundwork—specifically mapping out the Universal RAM Allocator and Hardware Router. The dev environment is fully set up, and the initial Python routing logic is underway. Next up: testing tight memory constraints and seeing how it handles the load!

Attachment
0