TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS
65 points - today at 6:06 PM
SourceComments
At least this one gave credit to the upstream projects which it used as a reference.
The llama.cpp project is also getting a wave of vibecoded PRs that are very clearly being produced by pointing claude at the repo and the original paper and having it produce something.
Almost none of these attempts contain information that really matters, like actual benchmark tests with differen KV quantization levels (not just perplexity or KLD).
Llama.cpp already has KV compression and one of the turbo quant PRs will get merged at some point.
If you don’t care about the fancy 3 bit, the q8 KV compression is good enough! Don’t bother with q4
./build/bin/llama-server -m model.gguf \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ -c 65536
Etc
./SwiftLM \
--model mlx-community/Qwen3.5-122B-A10B-4bit \
--stream-experts \
--port 5413
Error: [SwiftLM] Loading model: mlx-community/Qwen3.5-122B-A10B-4bit
[SwiftLM] Enabled Async SSD Streaming on directory: e9c67b08899964be5fdd069bb1b4bc8907fe68f5
[SwiftLM] Memory strategy: FULL GPU (69.6GB model, 133.4GB available)
[SwiftLM] Download: [===================>] 100% â ‹ (66395.4 MB / 66395.4 MB) | Speed: 0.0 MB/s
MLX error: Failed to load the default metallib. library not found library not found library not found library not found at /Users/runner/work/SwiftLM/SwiftLM/LocalPackages/mlx-swift/Source/Cmlx/mlx-c/mlx/c/stream.cpp:115Ofcourse large corps will have fancy proprietary models, but for every day queries and tasks, local feels like a huge, and just slightly out of reach.
Am i missing something fundamental?
TurboQuant KV compression: We ported the V3 Lloyd-Max codebooks from the TurboQuant paper (Zandieh et al., ICLR 2026) into native C++ and fused dequantization into Metal shaders. This achieves a measured 4.3Ă— KV cache compression at runtime, completely eliminating Python overhead.
SSD Expert Streaming: To fit a 122B parameter model (e.g., Qwen3.5-122B MoE) without triggering macOS VM swapping or Watchdog kernel kills, the full ~60 GB weight file remains on NVMe. Only the top-k active expert pages are streamed to the GPU per forward pass at ~9 GB/s. As a result, inference runs with only 2,694 MB of active GPU VRAM on the M5 Pro 64GB, while the OS page cache automatically handles hot-expert reuse.
By combining these two approaches, we can comfortably run massive models in memory-constrained environments on Apple Silicon.
Also tested QWEN 4B on IPHONE 13 Pro.
Code and implementation details: https://github.com/SharpAI/SwiftLM
One thing I've turned up in smaller models and I'm sort of winding my way toward verifying in larger ones is that if you train the MoE model from scratch with this kind of knockout / subset of experts baked in, then you get significantly better loss outcomes. In small models, it's actually better than training an MOE without conditioning on a reduced set of experts per pass.
Anyway, pretty cool. There's some Pareto-optimal curve based on memory bandwidth, amount of GPU / unified RAM and inference compute times for streaming stuff in.