Segfault issue(fixed), 4D Tensor and mmap implementation, and dynamic memory allocation
I was about to complete it but due to constantly updating Devlog text, I have to push the code so I can resume again with my work.
Update:
- I dealt with segmentation faults, and the issue was a headache that took about 4hrs to fix it. At the peak of chaos, everything was falling towards CPU instead of GPU. But somehow on reverting it using git and re-implementations with tensor 4D shape and dynamic memory allocation worked!
What I Built:
- I used the CMake to bridge the ggml library of c++ with current hardware it is running on. Added the builds for the
Apple, Cuda and,Vulkan as current support.
- I used 1D Tensor first but due to matrix multiplication and the newer models developing so fast I used the 4D tensor
(ne0, ne1, ne2, ne3) (so in future I could implement this same idea in the images too). Since I am using ggml so it also solves the issue for 3D and 2D tensor.
- I used
no_alloc = true so system can work perfectly without taking up much ram.
- Due to using
dynamic_mem_size it can load even the llama 3 405B instructions.
The Result:
- The engine dynamically calculates the failsafe RAM boundaries, slices the massive model into required number of safe chunks, and perfectly maps the 4D tensors straight into the GPU/CPU without a single memory leak or crash. e.g.
[FloatLLM C++] Mapped output_norm.weight| Shape: [4096, 1, 1, 1]| Target hardware: MTL0
[FloatLLM] Chunk 7 Executed. Hardware link closed.
[FloatLLM] --------------------------------------------------------------------------------
[FloatLLM] Engine successfully mapped.
[FloatLLM(C++)] Assembling computation graph...
[FloatLLM(C++)] Graph memory reserved. Pushing data to compute cores...
[Float(C++)] Matrix math complete. Target hardware stabilized.
Note: Since the time limit hit about 10hrs so I had to push it and so currently it doesn’t support much dynamic system to build cmake on every hardware so I will fix this in the next Devlog. So mimimum required hardware ram due to cmake is more than 4Gb. But somehow I found way to make it work using even less by using this command for my android :
rm -rf build
cmake -B build \
-DCMAKE_BUILD_TYPE=MinSizeRel \
-DGGML_VULKAN=OFF \
-DGGML_DIR=../ggml \
-DCMAKE_C_FLAGS="-Os -g0" \
-DCMAKE_CXX_FLAGS="-Os -g0" \
-DCMAKE_EXE_LINKER_FLAGS="-Wl,--no-keep-memory"
cmake --build build --config MinSizeRel -j 1