Video Encoding and Decoding with Vulkan Compute Shaders in FFmpeg

126 points - last Tuesday at 1:02 AM

Comments

pandaforce today at 2:40 PM

The main target for this are NLEs like Blender. Performance is a large part of the issue. Most users still just create TIFF files per frame before importing them into a "real editor" like Resolve. Apple may have ASICs for ProRes decoding, and Resolve may be the standard editor that everyone uses.

But this goes beyond what even Apple has, by making it possible to work directly with compressed lossless video on consumer GPUs. You can get hundreds of FPS encoding or decoding 4k 16-bit FFv1 on a 4080, while only reading a few gigabits of video per second, rather than tens and even hundreds of gigabits that SSDs can't keep up. No need to have image degradation when passing intermediate copies between CG programs and editing either.

null-phnix today at 2:54 PM

A lot of the confusion in this thread feels like it comes from thinking in terms of web streaming rather than the workloads this post is targeting.

The article is pretty explicit that this is not about "make Twitch more efficient" or squeezing a bit more perf out of H.264. It is about mezzanine and archival formats that are already way beyond what a single CPU, even a decade old workstation CPU, handles comfortably in real time: 4K/6K/8K+ 16‑bit, FFv1-style lossless, ProRes RAW, huge DPX sequences, etc. People cutting multi‑camera timelines of that kind of material are already on the wrong side of the perf cliff and are often forced into very specific hardware or vendors.

What Vulkan compute buys you here is not "GPUs good, CPUs bad", it is the ability to keep the entire codec pipeline resident on the GPU once the bitstream is there, using the same device that is already doing color, compositing and FX, and to do it in a portable way. FFmpeg’s model is also important: all the hairy parts stay in software (parsing, threading, error handling), and only the hot pixel crunching is offloaded. That makes this much more maintainable than the usual fragile vendor API route and keeps a clean fallback path when hardware is not available.

From a practical angle, this is less about winning a benchmark over a good CPU encoder for 4K H.264, and more about changing what is feasible on commodity hardware: e.g., scrubbing multiple streams of 6K/8K ProRes or FFv1 on a consumer GPU instead of needing a fat workstation or dailies transcoded to lighter proxies. For people doing archival work or high end finishing on a budget, that is a real qualitative change, not just an incremental efficiency tweak.

jokoon today at 3:48 PM

I once asked on #ffmpeg@libera if the GPU could be used to encode h264, and apparently yes, but it's not really worth it compared to CPU.

I don't know much about video compression, does that mean that a codec like h264 is not parallelizable?

hirako2000 today at 2:21 PM

Vulkan Compute shaders make GPU acceleration practical for intensive codecs like FFv1, ProRes RAW, and DPX. Previous hybrid GPU + CPU suffered the round-trip latency. These are fully GPU hands offs. A big deal for editing workflows.

kvbev today at 3:18 PM

could this have an AV1 decoder for low power hardware that are without AV1 gpu accelerated decoding? for my N4020 laptop.

maybe a raspberry pi 4 too.

positron26 today at 1:25 PM

> Most popular codecs were designed decades ago, when video resolutions were far smaller. As resolutions have exploded, those fixed-size minimum units now represent a much smaller fraction of a frame — which means far more of them can be processed in parallel. Modern GPUs have also gained features enabling cross-invocation communication, opening up further optimization opportunities.

One only needs to look at GPU driven rendering and ray tracing in shaders to deduce that shader cores and memory subsystems these days have become flexible enough to do work besides lock-step uniform parallelism where the only difference was the thread ID.

Nobody strives for random access memory read patterns, but the universal popularity of buffer device address and descriptor arrays can be taken somewhat as proof that these indirections are no longer the friction for GPU architectures that they were ten years ago.

At the same time, the languages are no longer as restrictive as they once were. People are recording commands on the GPU. This kind of fiddly serial work is an indication that the ergonomics of CPU programming have less of a relative advantage, and that cuts deeply into the tradeoff costs.

sylware today at 12:45 PM

Well, the problem with hardware decoding is it cannot handle all the variations in data corruption which results in hardware crash, sometimes not recoverable with a soft reset of the hardware block.

It is usually more reasonable to work with software decoders for really complex formats, or only to accelerate some heavy parts of the decoding where data corruption is really easy to deal with or benign, or aim for the middle ground: _SIMPLE_ and _VERY CONSERVATIVE_ compute shaders.

Sometimes, the software cannot even tell the hardware is actually 'crashed' and spitting non-sense data. It goes even worse, some hardware block hot reset actually do not work and require a power cycle... Then a 'media players' able to use hardware decoding must always provide a clear and visible 'user button' in order to let this very user switch to full software decoding.

Then, there is the next step of "corruption": some streams out there are "wrong", but this "wrong" will be decoded ok on only some specific decoders and not other ones even though the format is following the same specs.

What a mess.

I hope those compute shaders are not using that abomination of glsl(or the dx one) namely are SPIR-V shaders generated with plain and simple C code.

doctorpangloss today at 1:38 PM

What is the use case? Okay, ultra low latency streaming. That is good. But. If you are sending the frames via some protocol over the network, like WebRTC, it will be touching the CPU anyway. Software encoding of 4K h264 is real time on a single thread on 65w, decade old CPUs, with low latency. The CPU encoders are much better quality and more flexible. So it's very difficult to justify the level of complexity needed for hardware video encoding. Absolutely no need for it for TV streaming for example. But people keep being obsessed with it who have no need for it.

IMO vendors should stop reinventing hardware video encoding and instead assign the programmer time to making libwebrtc and libvpx better suit their particular use case.

fhn today at 3:10 PM

This article assumes all GPUs are on a PCIe bus but some are part of the CPU so the distance problem is minimal and offloading to GPU might still be net +. Might because I haven't tested this