Mamba-3

257 points - last Tuesday at 10:45 PM

Source

Comments

nl today at 6:22 AM
I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.
roger_ today at 5:58 PM
Can anyone explain why Mamba models start with a continuous time SSM (and discretize) vs discrete time?

I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.

Havoc today at 1:27 PM
Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to something linear like mamba as context grows
jychang today at 9:44 AM
I'm not sure that I buy their conclusion that more compute during inference is good.

Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.

With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.

If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.

fudged71 today at 6:02 PM
This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?
jeffhwang today at 3:56 PM
I'm glad I clicked through bc I thought the article was about Mamba, the package manager I associate with Python (similar to conda).

https://github.com/mamba-org/mamba

manlymuppet today at 5:56 PM
I'm looking forward to the fifth iteration of this model.
diablevv today at 2:21 PM
[dead]
daliliu today at 1:02 PM
[dead]
robofanatic today at 6:09 AM
> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.

Why can’t they simply say -

Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.