FFmpeg Devs Boast of Up To 94x Performance Boost After Implementing Handwritten AVX-512 Assembly Code (tomshardware.com) 107
Anton Shilov reports via Tom's Hardware: FFmpeg is an open-source video decoding project developed by volunteers who contribute to its codebase, fix bugs, and add new features. The project is led by a small group of core developers and maintainers who oversee its direction and ensure that contributions meet certain standards. They coordinate the project's development and release cycles, merging contributions from other developers. This group of developers tried to implement a handwritten AVX512 assembly code path, something that has rarely been done before, at least not in the video industry.
The developers have created an optimized code path using the AVX-512 instruction set to accelerate specific functions within the FFmpeg multimedia processing library. By leveraging AVX-512, they were able to achieve significant performance improvements -- from three to 94 times faster -- compared to standard implementations. AVX-512 enables processing large chunks of data in parallel using 512-bit registers, which can handle up to 16 single-precision FLOPS or 8 double-precision FLOPS in one operation. This optimization is ideal for compute-heavy tasks in general, but in the case of video and image processing in particular.
The benchmarking results show that the new handwritten AVX-512 code path performs considerably faster than other implementations, including baseline C code and lower SIMD instruction sets like AVX2 and SSSE3. In some cases, the revamped AVX-512 codepath achieves a speedup of nearly 94 times over the baseline, highlighting the efficiency of hand-optimized assembly code for AVX-512.
The developers have created an optimized code path using the AVX-512 instruction set to accelerate specific functions within the FFmpeg multimedia processing library. By leveraging AVX-512, they were able to achieve significant performance improvements -- from three to 94 times faster -- compared to standard implementations. AVX-512 enables processing large chunks of data in parallel using 512-bit registers, which can handle up to 16 single-precision FLOPS or 8 double-precision FLOPS in one operation. This optimization is ideal for compute-heavy tasks in general, but in the case of video and image processing in particular.
The benchmarking results show that the new handwritten AVX-512 code path performs considerably faster than other implementations, including baseline C code and lower SIMD instruction sets like AVX2 and SSSE3. In some cases, the revamped AVX-512 codepath achieves a speedup of nearly 94 times over the baseline, highlighting the efficiency of hand-optimized assembly code for AVX-512.
Neat... But, (Score:5, Insightful)
Re: (Score:2)
Re: (Score:2)
DJB is a great programmer who has given much, but I disagree with him here.
I doubt it. Do you think compiler optimizations should cause the compiler to output incorrect, buggy code?
Re:Neat... But, (Score:4, Informative)
It's quite difficult to optimize this sort of stuff at compiler level because the compiler has to infer what the intent is, while still obeying all the rules that define the behaviour of C code.
As a compiler developer you could expend a lot of effort optimizing for this particular scenario, but it wouldn't help many other projects. It's a case where assembler is a good option to get some application specific performance enhancements.
I encountered a similar thing with ffmpeg on ARM some years ago. We needed to do some FFT calculations but the Microsoft compiler with it's intrinsic support was not producing fast enough output, so we used the code from ffmpeg which had hand written assembly for ARM NEON. Performance was orders of magnitude better and let us do it in real-time. Code was of course published as per GPL requirements.
Pretty fucking cool (Score:2)
I hope it's a real world benchmark not some contrived situation and propaganda from Intel.
Re:Pretty fucking cool (Score:5, Informative)
propaganda from Intel.
If anything, the propaganda would be from AMD.
Intel disabled AVX-512 for its Core 12th, 13th, and 14th Generations of Core processors, leaving owners of these CPUs without them. On the other hand, AMD's Ryzen 9000-series CPUs feature a fully-enabled AVX-512 FPU so the owners of these processors can take advantage of the FFmpeg achievement.
Re: (Score:2)
Yes, AVX-512 is just for professionals, didn't you know that?
Buy our Xeon Scalable processors for 4-10x the price if you need your PC to do things like... decode video quickly and efficiently!
No, don't look over at that other company. Look at me. You need professional. quality. chips.
Re: (Score:2)
where is AI? (Score:5, Funny)
Why didn't they just ask ChatGPT to rewrite it? Handwritten? Haven't we been told there's no reason for that?
Spit you out the same garbage, answer 94x faster. (Score:4, Funny)
If you ask ChatGPT to write you 94x faster code, it will spit you out the same garbage, just answer 94x faster.
Re: (Score:2)
Actually it will just write code that uses a function called Encode94xFaster().
Re: (Score:2)
They asked ChatGPT, but it said "Ain't no one got time for hand coding!"
Every boomer programmer just shrugged (Score:5, Insightful)
Re:Every boomer programmer just shrugged (Score:5, Informative)
I stopped coding/optimizing in assembly over 20 years ago, and then only utilized knowledge of it for debugging, cybersecurity, or fun purposes for a few years. Nowadays I have zero use for it other than getting super annoyed that people don't know it.
Re:Every boomer programmer just shrugged (Score:5, Interesting)
Hey punk, get off my ROM!
My last significant use of assembly was late 80s / early 90s. Working on embedded systems with 8-bit microprocessors, tiny boot ROM capacity, and rudimentary or no compiler, there was no choice. Around 2017, I had to take a deep dive into GCC compiled ARM code to characterize an obscure but dramatic failure in a specific embedded situation. This turned out to be incorrect code generated by GCC. Once characterized, it was not difficult to work around, but it required machine level knowledge to locate.
Re:Every boomer programmer just shrugged (Score:5, Insightful)
That, and programmers often use boneheaded algorithms because they don't know any better.
Remember Bubble Sort? If you tried to build the most inefficient algorithm possible, it's hard to imagine one that would beat Bubble Sort. And yet for years, every computer programming textbook taught this algorithm that's useful for basically nothing, and isn't even intuitive. Students normally react with "How does that even work???" But you know that algorithm made its way into more than a few production systems.
Software optimization employs some very specific techniques. Notably, using some kind of profiler to identify where your bottlenecks are, and looking for ways to reduce execution or loop counts, or ways to reduce the time spent in each iteration. There's a whole lot of software, including decoding algorithms, that never went through any kind of proper optimization analysis.
I agree, it's not surprising to find ways to increase performance by 90+%, regardless of the language chosen.
Re: (Score:3)
If you tried to build the most inefficient algorithm possible, it's hard to imagine one that would beat Bubble Sort.
Challenge accepted [wikipedia.org].
Re: (Score:2)
Funny! Well, since this sort compares its slowness to Bubble Sort, it would seem that Bubble Sort might still get 2nd place for slowest!
Re: (Score:2)
Bah - I like random sort, which is essentially:
swap two elements at random
check if the list is sorted now
repeat if the list is not sorted
Given enough time, it will sort the list, quite by accident. ;)
Re: (Score:2)
If truly random, it's not guaranteed to ever finish.
Or if your random number generator is broken (worst case it always returns the same number), then it certainly won't finish.
Re:Every boomer programmer just shrugged (Score:5, Interesting)
If you tried to build the most inefficient algorithm possible, it's hard to imagine one that would beat Bubble Sort.
Then you lack imagination. Bubble sort is O(n^2). There are O(n^3) sorting algorithms. Here's an O(n!) sort:
1. Shuffle data randomly
2. Test if it is sorted. If yes, you're done, else go to 1.
And yet for years, every computer programming textbook taught this algorithm that's useful for basically nothing
Bubble sort is useful for very small datasets, like 10 or so, and constrained memory or cache capacity.
But bubble sort is most taught as an example of a naive implementation leading to poor performance.
isn't even intuitive. Students normally react with "How does that even work???"
It's obvious why Bubble sort works. It is way easier to understand than Quicksort.
Re:Every boomer programmer just shrugged (Score:5, Funny)
1. Shuffle data randomly
2. Test if it is sorted. If yes, you're done, else go to 1.
Look, I've told you before - I get really tired of people reposting my code without attribution.
Re: (Score:2)
Remember Bubble Sort? If you tried to build the most inefficient algorithm possible, it's hard to imagine one that would beat Bubble Sort. And yet for years, every computer programming textbook taught this algorithm that's useful for basically nothing, and isn't even intuitive. Students normally react with "How does that even work???" But you know that algorithm made its way into more than a few production systems.
Bubble Sort sorts an already sorted list in O(n) time. Try doing the same thing with Merge Sor
Re: (Score:2)
Try using insertion sort in an array.
Re: (Score:2)
Ok
Re: (Score:2)
Shell sort and insertion sort are both simple and both do better than bubble sort, even in your "ideal" scenario.
Re: (Score:2)
Checking if your list is already sorted before fucking with it is usually a good idea.
Re: (Score:2)
I have not used linked lists in years. I typically use arrays which I sort with Radix Sort in CUDA.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Remember Bubble Sort? If you tried to build the most inefficient algorithm possible, it's hard to imagine one that would beat Bubble Sort
Bubble sort is faster than Quicksort for less than 8 items. That sounds like nothing, but then you realize many if not most sorts done probably have fewer than 8 items.
Re: (Score:2)
And shell sort and insertion sort would be even faster.
Re: (Score:2)
Re:Every boomer programmer just shrugged (Score:5, Informative)
And yet for years, every computer programming textbook taught this algorithm that's useful for basically nothing, and isn't even intuitive.
You didn't pay any attention in class. Bubble sort is held up in virtually every textbook as an example of something that does the base line job in an inefficient way. It is literally taught as an example of not being useful.
That said I think your assertion that it isn't intuitive is quite silly. It's probably the most intuitive algorithm that is. Is current value bigger than next value in array? If so, switch, repeat, done. There is literally nothing more intuitive than comparing two numbers and just moving the bigger one up the pile to sort something.
Re: (Score:2)
Perhaps where and when you went to school, they taught bubble sort as an example of inefficient programming. But maybe the 1970s were before your time. In those days, it was presented simply an example of how to sort.
Re: (Score:2)
Yeah, I've been thinking that if you assign an inexperienced programmer, like a student, to sort a list without any training, you're probably going to get something like bubblesort.
Re: (Score:2)
I learned bubble sort as a functional example of a while loop.
It is a super easy to understand program that does something useful and demonstrates some basic functions. We were also told it was pretty useless in real life.
I don't know why you think it's hard to understand, except for maybe that it's pretty much the first thing taught. It was meant to be hardish to understand but understandable at the point it was taught (like first week or so).
Re: Every boomer programmer just shrugged (Score:1)
Optimizing in assembly used to be a routine thing. I guess it went away because: 1. Takes forever. 2. Only very smart people can do it. 3. Idiot managers only want shipped product, today!
I think #2 is the real bottleneck. Geniuses have lots of options, working for idiots is a shitty option.
Re: Every boomer programmer just shrugged (Score:2)
I wonder how much faster it will be when it's written in rust?
Re: (Score:2)
Haha I came her to comment almost the same thing. Can't tell if you are being sarcastic though. But I sure was going to be.
The masturbating security monkeys sure seem to think Rust is hot shit but I want REAL WORLD examples dammit. If I was a billionaire I would be paying top programmers to battle head to head, and benchmark both the code, and TIME TO WRITE the code, of C vs Rust
Re: Every boomer programmer just shrugged (Score:2)
When does the competition end? You should be counting for vulnerabilities in the initial product and also over time as it gets updated with new features and existing bugs get fixed buy developers who didnâ(TM)t have anything to do with the original code.
Re: (Score:2)
Does that include the few minutes it takes to write memory safe libraries in C to wrap the most common functions?
Re: (Score:2)
Sounds like you've fixed the problem already! There will be no more use-after-free, uninitialised memory, double-free, or buffer overflows ever again, all it takes is a few minutes to fix it all.
Re: (Score:2)
Yeah I know you're beowulfing a cluster of Natalie hot grits at this point but...
This could be a test case to improve the optimizations of gcc/LLVM
(1) Craft hand-written assembly ...
(2) Assemble.
(3) Decompile into high level language
(4) Compile
(5) Compare results of (2) and (4)
(6) Repeat
(7)
(8) Profit???
This is the point where you find there is an instruction that does what you are trying to do and your hand crafted assembly was dumb.
I've added a couple of instructions to X86, so for those two instructions, I don't make that mistake. All the others are fair game.
Re: Every boomer programmer just shrugged (Score:2)
Could very well be Rust actually provides language-level control over SIMD if you want to be explicit about it.
https://doc.rust-lang.org/std/... [rust-lang.org]
And unlike assembly, it works across more than one architecture.
I've never tried to do this in C, but as far as I know C can only rely on the compiler and can't be done explicitly.
Re: (Score:3)
Well, maintenance nightmare aside, it's actually pretty hard to get performance out of handwritten assembly on x86. It's not worth it. I've seen the compiler spit out utter garbage but it still runs at roughly the same speed of handwritten assembly, just because of the amazing pipeline process the CPU has. The main benefit you'd get is a smaller binary file. The SIMD instructions are the outlier, where compiler support might not be good enough, where the instructions are difficult to generate for.
Re: (Score:2)
The things you really want to look at are "number of memory accesses" or "number of missed branch predictions." Aligning cache accesses correctly can also be a big win.
Re: (Score:2)
you can only retire 1 op per cycle if each of your ops depends on the result of the previous op
the real problem with compilers is that you get different performance depending on optimization settings, but not all optimizations are actually safe, and the benchmark will always be with the best set of all of that but probably when you grab the source and use it in your o
Re: (Score:2)
Fused Multiply and Add is technically a multiple operations executed in a single cycle.
But otherwise, agreed
Re: (Score:2)
Re: Every boomer programmer just shrugged (Score:2)
I think it's largely because compilers themselves have gotten pretty good at it, and the only way you can really improve things is with things like manual loop unrolling and tail recursion. But with modern processors you don't really gain much from that like you used to.
Here we have a case of some particular operations that are done more efficiently with SIMD (and probably even better with GPUs) than with traditional arithmetic operations and the compiler can't necessarily know how to best model around that
Re: (Score:2)
Mostly I have used assembler to do stuff needed in a system, stuff that a general purpose library does you on a full operating system. But in an embedded system you are the full operating system, and the RTOSs out there don't give you system startup code and the like. For instance, cache invalidation instructions, interrupt/exception handlers, memory barriers, context switching, etc. Other times you _know_ the code is very slow and can be sped up, and can't easily be sped up with pure standard C code.
In
Re: (Score:2)
What makes open source work is you can use all those libraries to put together incredibly useful and complex software but the downside is you can't just throw a few million dollars at paying some guy with a masters or a PhD in mathematics and computer science to do cool shit for you.
Getting these kind of optimizations done in an open source framework is super cool.
Re: (Score:2)
90% is a good start. But these guys got 9400%. Two factors of magnitude is the difference between get a bruise and sending your head across a football field.
Yeah I know, you meant 90x. Still it's a pretty major improvement for a tool that lots of people use, unlike your code...
Re: (Score:2)
Wash, rinse, repeat.
I remember I inherited some code with all kinds of these hand-tuned SSE things, luckily guarded by ifdefs. Turned them off, and boom, my code was nearly 2X faster. Because SSE had gone the way of the dino, and the compiler was able to take the straight code and do better with newer/wider SIMD registers.
Hopefully AVX-512 will be around long enough to actually matter for them; however I have my doubts because I saw in other comments Intel is dropping support for it in some CPU families.
Re: (Score:2)
Doesn't this depend on chip model? (Score:5, Informative)
Re: (Score:3)
Yes. Zen4 and especially Zen5 will benefit greatly. Intel dumped AVX512 support on their desktop CPUs with Alder Lake, so unless you're using a Xeon or something you will won't be able to use that codepath.
Well, that's one architecture. (Score:1)
Due to the nature of assembler, that code will be bound to one CPU architecture.
Won't help those of us on ARM or Apple silicon.
I suppose we could offload video over the network to an x86 architecture box, but that'll eat up some of that 94x.
Wonder what Windows and MacOS emulators make of raw machine code?
Re: (Score:2)
I have an acquaintance that works for one of the FAANG companies. The focus of their team is to hand write assembly for performance critical operations across the company. They do this for multiple chip architectures.
Your phone or PC might well have some of that code in it.
Re:Well, that's one architecture. (Score:4, Informative)
Re: (Score:2)
Apple doesn't yet support SVE2.
Re: (Score:1)
For video transcoding, you wouldn’t do it this way on Apple Silicon. You’d call the OS video library which would automatically invoke the video ASIC components of the CPU.
Re: (Score:2)
For video transcoding, you wouldn’t do it this way on Apple Silicon. You’d call the OS video library which would automatically invoke the video ASIC components of the CPU.
True, you would use the OS video library on an x86 platform, potentially incorporating the faster handwritten AVX-512 code.
If anyone can improve the speed of the Apple Silicon code libraries by using handwritten assembler code then they're free to submit it to Apple.
It's my guess is that there wouldn't be that much additional improvement in performance as was wrung out of the AVX-512 code.
Re: (Score:2)
Re: (Score:2)
IIRC ffmpeg does that at load time. When the library load, it checks your processor architecture and loads the right library to provide the implementation of these functions. So really it is done at the level of the symbol table. That gives you a no performance hit for these type of things.
Many BLAS libraries do things of that sort as well. (Intel MKL definitely does that.)
Re: (Score:2)
AVX is a set of CPU instructions designed for operating on multiple values with one instruction. If you go anywhere near this sort of stuff, you're already tying yourself to a specific type of CPU.
This is the kind of optimization you do for stuff that's really CPU intense, like video codecs. This is where you really want efficient code, not portable code.
Re: (Score:2)
It's funny. People posting all these replies seem to think I wasn't disassembling 6502 games for my Atari 400 in the 1980s so I could bust the copy protection and move them from tape to disk.
I'm well aware of what assembler is, how it works, what it does, and the advantages it confers, as well as the complexity it unfolds that higher-level languages and acceleration APIs do a good job of hiding. I haven't taken a close look at instruction sets recently (though I used to know the Z-80 set by heart).
I'm just
Re: (Score:1)
You probably do, because apple's choice is to take a crappy ARM core that is low performance low power CPU that can't really do anything well except power efficiency. And then a bolt a bunch of ASICs on top of it, that would handle very narrow tasks quickly, to compensate for slowness of the actual CPU.
And apple certainly has an ASIC bolted on for video processing. And there's almost certainly a library that optimizes for that ASIC. It's probably one of the first ones they determined as one that will be nec
ffmpeg is just... awesome. (Score:5, Interesting)
A while back, I was working at a place that did video production. They had these expensive, barely working video appliances that the license fees were just plain extortion, and the support was often, "buy our newer model, and we might fix that". I took the physical appliance, removed the disk with the vendor OS and set it aside if need be, installed Linux and used ffmpeg for everything that appliance did. It worked perfectly, and did what we needed it to do, and might as use the Supermicro hardware that the appliance was on for something, since the studio paid multiple orders of magnitude of what it cost.
Wish that place donated to the ffmpeg project because they saved insane amounts of money after stopping all transactions with the craptastic vendor.
Crappy summary. (Score:4, Informative)
Not surprised (Score:1)
There has to be a compiler issue here (Score:5, Interesting)
AVX-512 should not be able to give you a 94x speedup. What are they comparing against, -O1?
Using AVX-512 should make you go from processing 32bit at a time to processing 512 bits at a time. That should give you at least a 16x speedup.
Depending on architectures, you can get up to 3 ports of the processor filled with instructors, but that only brings it to 48x speedup. To get to get to a 94x speedup you'd need twice that. So not only you fill all three ports which the baseline couldn't do, but you also send these instruction twice as often with some unroll or something like that.
(Though you never actually get a 16x speedup on a vector port because in practice most processors have to downclock to have enough power to feed the 512 bit vector units. )
That also assume that the calculations are entirely instruction bound and not IO bound (including memory bound).
I do this shit for a living. I highly doubt the 94x speedup of avx-512 against a reasonable compiler with the appropriate compilation flag on a reasonable C implementation of the code. It is not that hard to write code that compile decently well. Even adding a littlebit of unroll and compiler instrinsic in important loops can give you significant performance improvement over -O3.
Getting a 94x improvement over decent code using handwritten assembly should not be possible. The last time I saw results like that was during the Cell era when compilers didn't know anything about the SPE except the ISA. I have to guess that the baseline here is something unreasonable like "gold standard compiled -O1".
Do someone has the paper that is being presented here?
(Though kudos for handwriting writing AVX-512 production code. That's not easy.)
Re: (Score:2)
I would imagine there's also improvements like better memory alignment to avoid cache misses. Maybe more inlining of code, avoiding function calls and branching.
They're probably doing broader optimizations than you're thinking.
Re: (Score:2)
all of that is totally doable in regular C code. You don't need to handwrite assembly to get there. And pretty much all of that is a generic optimization so you would want that to be part of your baseline.
Re: (Score:3)
When to inline is not a solved problem and we know its not a solved problem because compilers never do the opposite, they never find expressions the programmer repeated many times and turn that expression into a function. Where would you even begin with trying to automate such a thing fruitfully? We dont know the science.
Re: (Score:2)
Re: (Score:1)
If each core can process 1 "video artefact" every minute, with 8 cores that's 8 artefacts per minute. If they can now produce 16 artefacts each per minute, that's 120 artefacts per minute. That's still just 16x quicker it seems?
Re: (Score:2)
Re: (Score:2)
yeah the reported range was 3x-94x.
obviously the 94x was on some weird edge case that happens to hit the sweet spot of their optimization. it might even be some 16x16 video they created as a test input for their pipelining.
most people would just ignore it and focus on the 3x which is still good, but you do you.
Re: (Score:2)
Who'd a Thunk It? (Score:1)
Actually a 1.2x speedup (Score:3, Informative)
Re: (Score:2)
did you see the actual benchmark? Do you have a link? I couldn't find it.
Re: (Score:2)
Thank you Sir, this is why we read the comments.
Where AI code helpers really belong? (Score:2)
Re: (Score:2)
No, LLM (the type of AI you're talking about) can't do things like that. They are mostly glorified autocomplete tools based on existing code in their training database. Ask them for anything complex and they fail miserably. There's a video for "minecraft created in AI"... watch it. It's basically useless.
These kinds of things require insight and inference, which no AI is capable of and certainly not LLMs. An LLM is designed to say what you think it should say next. It's literally that simple. And that
If AVX-512 can do that... (Score:2)
imagine what a Beo... nah, pardon me
imagine what CUDA or OpenCL can do.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Needs ARM love too (Score:2)
Re: (Score:2)