MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU

222 points - today at 12:19 PM

Source

Comments

internetguy today at 12:51 PM
> MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state

This is pretty awesome. The only compute I have at home is an RTX 3080 with 10 GB of VRAM, so I struggle with training larger models (>40M, 50M params). I get OOM errors and have to optimize a lot.

I have a lot more CPU RAM in my PC, and this would likely increase the size of models I can train locally.

kouteiheika today at 2:37 PM
This isn't really anything new; I've been doing something like this for quite a while, I just haven't bothered writing a paper. (: Probably anyone who would seriously tackle the problem of "how do I train a huge model on a tiny amount of VRAM?" would come up with something similar.

However, most people in the field don't, because the actual practical utility of training huge models on a single GPU is quite low. (e.g they got 341 tok/s for a 14B model on a single 3090 while with my method I was getting ~1k tok/s on a single 4090; that's still very slow)

Also, there are more tricks one can use to speed up training/lower VRAM usage which they're not using. For example, you don't need any gradient offloading (you can just accumulate the gradients directly into the optimizers' states if you modify your optimizer), you can use Muon instead of Adam (which needs only half of VRAM of Adam), you can use quantization (both for parameters and for the optimizer states; e.g. I found Muon quantized into 4-bit working relatively well), etc.

bilekas today at 3:25 PM
> H200 GPU with 1.5TB host memory,

While yes it's one GPU.. It's not exactly a slim one.

drob518 today at 4:08 PM
I’m curious how this technique works, or not, with unified memory architectures such as Apple’s M series. It seems like it’s relying on using overlapping processes to help speed things up, but I would assume that having everything unified in main memory such that you don’t have to transfer everything back and forth to the GPU would also have some advantages. Can someone wiser explain this to me?
ilaksh today at 2:21 PM
How long would it actually take to train a 120B model on an H200? What if you have 8?
WithinReason today at 1:31 PM
I was wondering how well this would work :) You can definitely push this further, the question is: how well can the gradients and updates compress?
1aurent29 today at 1:58 PM
sounds very similar to https://docs.pytorch.org/docs/stable/distributed.fsdp.fully_... i wonder how much this could be replicated using only this pytorch primitive
olliepro today at 1:01 PM
This would likely only get used for small finetuning jobs. It’s too slow for the scale of pretraining.
ur-whale today at 5:56 PM
Why is it no one ever talks about the one thing no one can get their hands on except the big labs ?

I'm talking about the training set.

Sure there are some open sets out there.

But my guess is they are nowhere near what OpenAI, Google and Anthropic are actually using.

Happy to be proven wrong.

atlgator today at 2:52 PM
The GPU is no longer the brain, it's the hand. The brain is your RAM. Suddenly that 256GB DDR5 build your wife questioned is 'research infrastructure.'
l1n today at 1:24 PM
Seems similar to Microsoft DeepSpeed.
redoh today at 4:32 PM
[dead]
adamsilvacons today at 2:00 PM
[dead]
aivillage_team today at 6:48 PM
[dead]
edoardobambini- today at 12:21 PM
[dead]
andrewssobral today at 2:23 PM
[dead]
bdeol22 today at 2:46 PM
[dead]