Async/Await on the GPU

139 points - today at 4:53 PM

Comments

lmeyerov today at 11:34 PM

We have been curious about this for rapids cudf. We built the graphistry stack 10 years ago (!) for GPU native end-to-end acceleration - data loading, wrangling, analytics, enrichment, viz, etc, all the way server GPUs to client GPUs - but this has been a huge sticking point. Major constant overheads for smaller workloads that seem avoidable.

Essentially, we solved the problem of writing our stack in a bulk-oriented way that Nvidia kernels can optimize. Think apache arrow, pure vectorized dataframe pipelines, etc. However, cudf is 'eager' with per-step CPU/GPU control plane coordination, even if the data plane lives on the GPU. Polars in theory moves to lazy scheduling that can allow deforesting optimizations for more bulk GPU-side control macro steps, but not really. Nvidia efforts to cut python asyncio costs for multitenant etc flows didn't pan out either. So enabling moving more to the GPU here is super interesting.

Will be watching!

Arifcodes today at 9:02 PM

The interesting challenge with async/await on GPU is that it inverts the usual concurrency mental model. CPU async is about waiting efficiently while I/O completes. GPU async is about managing work distribution across warps that are physically executing in parallel. The futures abstraction maps onto that, but the semantics are different enough that you have to be careful not to carry over intuitions from tokio/async-std.

The comparison to NVIDIA's stdexec is worth looking at. stdexec uses a sender/receiver model which is more explicit about the execution context. Rust's Future trait abstracts over that, which is ergonomic but means you're relying on the executor to do the right thing with GPU-specific scheduling constraints.

Practically, the biggest win here is probably for the cases shayonj mentioned: mixed compute/memory pipelines where you want one warp loading while another computes. That's exactly where the warp specialization boilerplate becomes painful. If async/await can express that cleanly without runtime overhead, that is a real improvement.

xiphias2 today at 7:19 PM

Really cool experiment (the whole company).

Training pipelines are full of data preparation that are first written on CPU then moving to GPU and always thinking of what to keep on CPU and what to put on GPU, when is it worth to create a tensor, or should it be tiling instead. I guess your company is betting on solving problems like this (and async-await is needed for serving inference requests directly on the GPU for example).

My question is a little bit different: how do you want to handle the SIMD question: should a rust function be running on the warp as a machine with 32 long arrays as data types, or always ,,hope'' for autovectorization to work (especially with Rust's iter library helpers).

zozbot234 today at 5:49 PM

I'm not quite seeing the real benefit of this. Is the idea that warps will now be able to do work-stealing and continuation-stealing when running heterogenous parallel workloads? But that requires keeping the async function's state in GPU-wide shared memory, which is generally a scarce resource.

starkiller today at 10:39 PM

Love this.

You mention futures are cooperative and GPUs lack interrupts, but GPU warps already have a hardware scheduler that preempts at the instruction level. ARe you intentionally working above that layer, or do you see a path to a fture executor that hooks into warp scheduling more directly to get preemptive-like behavior?

GZGavinZhao today at 7:53 PM

One concern I have is that this async/await approach is not "AOT"-enough like the Triton approach, in the sense that you know how to most efficiently schedule the computations on which warps since you know exactly what operations you'll be performing at compile time.

Here with the async/await approach, it seems like there needs to be manual book-keeping at runtime to know what has finished, what has not, and _then_ consider which warp should we put this new computation in. Do you anticipate that there will be measurable performance difference?

textlapse today at 6:03 PM

What's the performance like? What would the benefits be of converting a streaming multiprocessor programming model to this?

shayonj today at 5:35 PM

Very cool to see this and something I have been curious about myself and exploring the space as well. I'd be curious what are some parallels and differentiations between this and NVIDIA's stdexec (outside of it being in Rust and using Future, which is also cooL)

ismailmaj today at 8:44 PM

Warp specialization is an abomination that should be killed and I'm glad this could be an alternative.

I hope they can minimize the bookkeeping costs because I don't see it gain traction in AI if it hurts big kernels performance.

Arch485 today at 6:39 PM

Very cool!

Is the goal with this project (generally, not specifically async) to have an equivalent to e.g. CUDA, but in Rust? Or is there another intended use-case that I'm missing?

the__alchemist today at 7:10 PM

Et tu, GPU?

I am, bluntly, sick of Async taking over rust ecosystems. Embedded and web/HTTP have already fallen. I'm optimistic this won't take hold in GPU; well see. Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.

I use rust on the GPU for the following: 3d graphics via WGPU, cuFFT via FFI, custom kernels via Cudarc, and ML via Burn and Candle. Thankfully these are all Async-free.

firefly2000 today at 6:06 PM

Is this Nvidia-only or does it work on other architectures?

bionhoward today at 8:21 PM

genius, great idea and follow through, please keep it up, this could improve the ML industry tremendously, maybe some einops inspired interface for this would be good?