TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

(github.com)

75 points | by aegis_camera 4 hours ago

10 comments

Aurornis 3 hours ago
Although I'm interested in both topics (KV compression and attempts to stream MoE models from storage) this is at least the 10th vibecoded project on this topic I've seen today alone across HN, Twitter, and some subreddits I visit.
At least this one gave credit to the upstream projects which it used as a reference.
The llama.cpp project is also getting a wave of vibecoded PRs that are very clearly being produced by pointing claude at the repo and the original paper and having it produce something.
Almost none of these attempts contain information that really matters, like actual benchmark tests with differen KV quantization levels (not just perplexity or KLD).
[-]
- zozbot234 1 hour ago
  The performance gain in the recent Flash-MoE implementations is seemingly obtained mostly by coalescing the data for each single MoE layer-expert into a single sequential extent which can be read efficiently from SSD. If so, this will actually require some changes in the underlying GGUF format; though the GGUF standard provides explicitly for specifying different data layouts, so the additions are arguably minor.
  As far as the TurboQuant thing goes, it seems that attn-rot has recently been merged in, which is a lightweight variety of it and written by the original llama.cpp author, so not an outside pull req.
  [-]
  - Aurornis 28 minutes ago
    > As far as the TurboQuant thing goes, it seems that attn-rot has recently been merged in, which is a lightweight variety of it and written by the original llama.cpp author, so not an outside pull req.
    Yes, read the first sentence of the PR for it. The project is a constant target for vibecoded PRs and they're trying to stay in front of it:
    > In anticipation of the incoming flood of vibe generated PRs implementing TurboQuant, I'm raising the baseline a bit
- _zoltan_ 3 hours ago
  "vibe coded" is NOT the bad thing you think it is.
  Going from paper to implementation from scratch in half an hour or so is great.
  [-]
  - mjr00 2 hours ago
    > "vibe coded" is NOT the bad thing you think it is.
    It's not inherently bad in the same way that a first draft of a novel is not inherently bad.
    But if someone asked me to read their novel and it was a first draft that they themselves had clearly not bothered reading or editing, I'd tell them to fuck off.
    [-]
    - sumeno 2 hours ago
      At least in the novel example the author had the decency to write what they're asking you to read.
      These are more like sending someone who didn't ask you a question a LMGTFY link they didn't ask for and expecting them to read all the results. Just a complete lack of awareness and respect for the maintainers
  - simonw 2 hours ago
    Sure, but the problem is when you take that half hour of work and share it with other people without making clear how much effort has gone into it.
    Software is valuable if it has been tested and exercised properly by other people. I don't care if you vide coded it provided you then put the real work in to verify that it actually works correctly - and then include the proof that you've done that when you start widely sharing it with the world.
    Right now it's impossible to tell which of these projects implementing the paper are worth spending time with.
    [-]
    - kristjansson 1 hour ago
      > without making clear how much effort has gone into it
      I'm increasingly convinced this is the critical context for sharing LLM outputs with other people. The robots can inflate any old thought into dozens of pages of docs, thousands of lines of MR. That might be great! But it completely severs the connection between the form of a work and the author's assessment/investment/attachment/belief in it. That's something one's audience might like to know!
    - dalemhurley 1 hour ago
      Is t the point of an MVP to be an MVP?
      The OP put together a POC and shared it, showing novel concepts used together. They are not some large R&D lab.
      The purist tests being asked for is in contradiction to the ShowHN guidelines.
      [-]
      - aegis_camera 3 minutes ago
        Thanks, we are not large R&D lab, limited resources. We were working on a product with is a Local VLM first BYOD when you want Video Security application, our users requested to have a MLX backend benchmark comparison, we tried hard to not deliver with Python in the application bundle, so we searched for a pure binary based MLX implementation the results shown we need to build one. It took us two weeks to get it working and we had been testing with multiple models. As a reference, you can see the result here: https://www.sharpai.org/benchmark/
        Then we saw the announcement from Google about TurboQuant, it's so cool, so we started to integrate them (alone with SSD/Flash caching). It's a nontrivial process and thanks for your support and understanding.
      - Aurornis 42 minutes ago
        > The OP put together a POC and shared it, showing novel concepts used together.
        That's the contention: There are countless POCs for these concepts already, and some of them were used as the basis for this project.
        It's not really a novel POC, it's the result of putting the previous work into Claude Code and telling it to rewrite it in Swift, then putting your name on it. To be fair, the person did start adding the reference projects to the very end of the README
        But if you didn't what to look for, you'd assume this was a very novel project attributable to their own work
      - simonw 1 hour ago
        This post wasn't marked as a Show HN.
        [-]
        aegis_camera 10 minutes ago
        Tried, but wrong time to post, it got zero attention . :)
      - th0ma5 1 hour ago
        [dead]
  - Aurornis 2 hours ago
    > Going from paper to implementation from scratch in half an hour or so is great.
    This repo isn’t showing that at all. Scroll to the bottom of the README and you’ll see the other project it was based on. It’s a translation of other people’s work.
    There have been dozens or perhaps hundreds of vibecoded TurboQuant examples posted around the usual forums in the past few days. This one doesn’t even include anything helpful like benchmarks or tests. It’s just some proof of concept code that doesn’t even work if you try to run it.
    My problem with this specific type of vibe coded project is that it’s initially presented as something more novel or polished in order to get more upvotes, karma, likes, or pad a resume. Then you read it and discover they just pointed Claude at some other projects and told it to produce something similar, then posted it as their own work.
  - brokencode 2 hours ago
    That’s a starting spot, but how about some testing and benchmarks?
    Where’s the value added if the person just tells Claude to do it and then submits a PR?
    The maintainers may as well vibe code it themselves if that’s all the work the would-be contributor is going to put into it.
    [-]
    - yieldcrv 2 hours ago
      if it works it works
      we live in a wholly unoptimized world because the available resources have been so high, while the benefits of optimizing have been so low. that has flipped now and there are tons of low hanging fruit to optimize.
      I agree that benchmarks would be great, but thats only relevant to this one topic, not the overall agentic coded pull request concept itself
      [-]
      - jmalicki 2 hours ago
        It's relevant in that it's an example that people are doing the easy part - the coding - and skipping the hard part - the benchmarking and proving it works and provides value.
        A PR without evidence it works and expectations for the benefits using the new feature would bring is kind of worthless.
      - pqtyw 2 hours ago
        It might work, but what's the point is sharing it if anyone can do the same in those 30 minutes with minimal effort?
      - sumeno 2 hours ago
        > if it works it works
        If it works in one case that doesn't mean it works consistently or well in the general case
        I've made lots of things with Claude Code that just work... until I do things in a slightly different order and the whole thing explodes
  - sroussey 2 hours ago
    The authors of the project have CC as well, so doing this is just eating their time.
  - pqtyw 2 hours ago
    If there is nothing valuable it contributes, though? i.e. its not a novel paper then only value is the whatever you personally learn from it.
robotswantdata 2 hours ago
Feels 100% vibe coded in a bad way.
Llama.cpp already has KV compression and one of the turbo quant PRs will get merged at some point.
If you don’t care about the fancy 3 bit, the q8 KV compression is good enough! Don’t bother with q4
./build/bin/llama-server -m model.gguf \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ -c 65536
Etc
[-]
- aegis_camera 1 hour ago
  One of my user requested MLX comparison with GGUF, he wanted to run the benchmark, I was thinking about how to get MLX support without bundling the python code together with SharpAI Aegis, a Local or BYOK local security agent https://www.sharpai.org. Then I had to pick up the Swift and create it.
  The benchmark shows a benefit of MLX engine, so it's user's choice which engine to use, aegis-ai supports both : )
aegis_camera 4 hours ago
We implemented two techniques to run massive 100B+ parameter MoE models natively on the M5 Pro 64GB MacBook Pro:
TurboQuant KV compression: We ported the V3 Lloyd-Max codebooks from the TurboQuant paper (Zandieh et al., ICLR 2026) into native C++ and fused dequantization into Metal shaders. This achieves a measured 4.3× KV cache compression at runtime, completely eliminating Python overhead.
SSD Expert Streaming: To fit a 122B parameter model (e.g., Qwen3.5-122B MoE) without triggering macOS VM swapping or Watchdog kernel kills, the full ~60 GB weight file remains on NVMe. Only the top-k active expert pages are streamed to the GPU per forward pass at ~9 GB/s. As a result, inference runs with only 2,694 MB of active GPU VRAM on the M5 Pro 64GB, while the OS page cache automatically handles hot-expert reuse.
By combining these two approaches, we can comfortably run massive models in memory-constrained environments on Apple Silicon.
Also tested QWEN 4B on IPHONE 13 Pro.
Code and implementation details: https://github.com/SharpAI/SwiftLM
[-]
- anemll 2 hours ago
  Check it out, you might be able to speed it up using this https://github.com/Anemll/anemll-flash-mlx https://x.com/anemll/status/2038684375425200360
  [-]
  - aegis_camera 1 hour ago
    Thanks, pure Swift was the design idea and since I found nothing could be used for my project https://www.sharpai.org then I created Swift version. Python is too heavy to be delivered with application, user mentioned they want to use MLX, that's why I've been working on it for 1-2 weeks for bug fixing and testing , then suddenly TurboQuant proposed, I had a quick integration. My 64GB M5 Pro is already good for my local security task, now it's able to use M1/M2 Mini w/ 8GB memory.
- altruios 3 hours ago
  what tokens/s are you getting with a 122B MoE model in this setup? I didn't see any benchmarks in the benchmarks section on the readme.md
  [-]
  - aegis_camera 1 hour ago
    https://www.sharpai.org/benchmark/ The MLX part is what we've done with SwiftLM, the local result is still being verified more details are on-going.
  - aegis_camera 2 hours ago
    I'll add more details. We just wired up the pipeline on both MAC and IOS.
  - gigatexal 2 hours ago
    yeah this I'd like to see added to teh readme.
simonw 2 hours ago
I couldn't get the downloadable binary to work, or the binary I compiled myself:
```
  ./SwiftLM \
    --model mlx-community/Qwen3.5-122B-A10B-4bit \
    --stream-experts \
    --port 5413
```
Error:
```
  [SwiftLM] Loading model: mlx-community/Qwen3.5-122B-A10B-4bit
  [SwiftLM] Enabled Async SSD Streaming on directory: e9c67b08899964be5fdd069bb1b4bc8907fe68f5
  [SwiftLM]  Memory strategy: FULL GPU (69.6GB model, 133.4GB available)
  [SwiftLM] Download: [===================>] 100% ⠋ (66395.4 MB / 66395.4 MB) | Speed: 0.0 MB/s      
  MLX error: Failed to load the default metallib. library not found library not found library not found library not found  at /Users/runner/work/SwiftLM/SwiftLM/LocalPackages/mlx-swift/Source/Cmlx/mlx-c/mlx/c/stream.cpp:115
```
[-]
- simonw 2 hours ago
  Claude Code helped me figure out this recipe (inspired by a similar workaround in the CI scripts):
```
  git clone --recursive https://github.com/SharpAI/SwiftLM.git
  cd SwiftLM

  swift build -c release

  # Trick to copy in that missing mlx.metallib file
  uv run --with mlx-metal python -c "
  import importlib.metadata, pathlib, shutil
  d = importlib.metadata.distribution('mlx-metal')
  metallib = pathlib.Path(d._path).parent / 'mlx/lib/mlx.metallib'
  shutil.copy(metallib, '.build/release/')
  print(f'Copied {metallib} -> .build/release/mlx.metallib')

  # Now start the server (downloads 69GB Qwen model)
  .build/release/SwiftLM \
    --model mlx-community/Qwen3.5-122B-A10B-4bit \
    --stream-experts \
    --port 5413
```
  But the server crashed when I tried to run a prompt through it:
```
  freed pointer was not the last allocation
```
  [-]
  - aegis_camera 1 hour ago
    the Python mlx-metal trick is actually what's crashing it. The mlx.metallib from pip is a different version of MLX than what your Swift binary was built against. It gets past the startup error but then corrupts the GPU memory allocator at inference time → freed pointer was not the last allocation.
    Use the version-matched metallib that's already in the repo:
    cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \ .build/release/ .build/release/SwiftLM \ --model mlx-community/Qwen3.5-122B-A10B-4bit \ --stream-experts \ --port 5413 This is the exact metallib that was compiled alongside the Swift code — no version mismatch. Future pre-built releases will bundle it automatically.
- aegis_camera 2 hours ago
  git clone https://github.com/SharpAI/SwiftLM # no --recursive needed cd SwiftLM swift build -c release ### Please let me know if this fix the issue:
  # Copy metallib next to the binary (one-time step) cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \ .build/release/
vessenes 3 hours ago
I like this idea on expert streaming. I've been poking around fairly thoroughly at the same idea - can we fix a set of experts? when can we fix them? How long is the top-k selection "good" for in terms of number of forward passes?
One thing I've turned up in smaller models and I'm sort of winding my way toward verifying in larger ones is that if you train the MoE model from scratch with this kind of knockout / subset of experts baked in, then you get significantly better loss outcomes. In small models, it's actually better than training an MOE without conditioning on a reduced set of experts per pass.
Anyway, pretty cool. There's some Pareto-optimal curve based on memory bandwidth, amount of GPU / unified RAM and inference compute times for streaming stuff in.
[-]
- aegis_camera 2 hours ago
  [flagged]
gervwyk 2 hours ago
Anyone else looking at these developments and thinking that local llms are the future. So many advantages above remote, and the hardware is just not there jet, but another leap like apple silicon and the tech is there..
Ofcourse large corps will have fancy proprietary models, but for every day queries and tasks, local feels like a huge, and just slightly out of reach.
Am i missing something fundamental?
[-]
- daft_pink 1 hour ago
  I’ve always believed local is the future. If you consider how your iPhone has a processor that is more powerful than something very large not too long ago.
  [-]
  - aegis_camera 1 hour ago
    I've ran this on an IPHONE 13 pro (6GB) memory, QWEN 3 1.7B runs good. So local will get more intelligent for the task you want it done soon or already.
- cl0ckt0wer 2 hours ago
  llm intelligence seems to be proportional to the ram used. All techniques like this will be used by everyone.
  [-]
  - zozbot234 1 hour ago
    You can almost always use less RAM by making inference slower. Streaming MoE active weights from SSD is an especially effective variety of this, but even with a large dense model, you could run inference on a layer-wise basis (perhaps coalescing only a few layers at a time) if the model on its own is too large for your RAM. You need to store the KV-cache, but that takes only modest space and at least for ordinary transformers (no linear attention tricks) is append-only, which fits well with writing it to SSD (AIUI, this is also how "cached" prompts/conversations work under the hood).
boogerlad 3 hours ago
Does this use anything from the flash-moe project?
https://github.com/Alexintosh/flash-moe
[-]
- aegis_camera 2 hours ago
  Yes, this is a reference project, the main different is we don't use os swap ( it introduces latency, will add https://github.com/danveloper/flash-moe to the original reference as well ).
daft_pink 1 hour ago
Can this work on M1, M2, M3, M4?
[-]
- aegis_camera 1 hour ago
  Yes, I've ran it on IOS, IPHONE 13 pro beside M5 pro, I'll test it on my M2 Mini and M3 Air.
xiphias2 2 hours ago
Another project without running real benchmarks. It's very easy to generate tokens, it's much harder to solve tasks locally.
[-]
- aegis_camera 1 hour ago
  Here is a reference https://www.sharpai.org/benchmark/ For specific tasks, local model could achieve workable level.
dalemhurley 1 hour ago
[dead]