I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.
roger_today at 5:58 PM
Can anyone explain why Mamba models start with a continuous time SSM (and discretize) vs discrete time?
I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.
Havoctoday at 1:27 PM
Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to something linear like mamba as context grows
jychangtoday at 9:44 AM
I'm not sure that I buy their conclusion that more compute during inference is good.
Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.
With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.
If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.
fudged71today at 6:02 PM
This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?
jeffhwangtoday at 3:56 PM
I'm glad I clicked through bc I thought the article was about Mamba, the package manager I associate with Python (similar to conda).
I'm looking forward to the fifth iteration of this model.
diablevvtoday at 2:21 PM
[dead]
daliliutoday at 1:02 PM
[dead]
robofanatictoday at 6:09 AM
> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.
Why can’t they simply say -
Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.