
Video Encoding on Graviton in 2025
April 3, 2025In 2022, we published a post describing the advantages of running video encoding workloads on AWS Graviton processors. Since that time, AWS launched Graviton4 powered C8g instances which offer up to 30% better performance than Graviton3. On video encoding workloads, Graviton4 performs 12-15% better than Graviton3, depending on the encoder, as shown in the following chart. During the same time, AWS and other partners contributed to open source projects such as x265 and SVT-AV1, increasing their performance. The x265 library performance was improved by 16-34% since the 3.6 release version. SVT-AV1 has also improved; in 2022, it was missing important optimizations and ran many times more slowly on Graviton than on other Amazon Elastic Compute Cloud (Amazon EC2) instances types.

Image 1: Percent improvement to frames-per-second of four different encoding workloads when comparing C8g (Graviton4) to C7g (Graviton3).
Software video encoders are more flexible than hardware accelerated encoders, so they can be the optimal choice for many use cases, especially offline transcoding or when using custom filters. Hardware accelerated encoders may be the right choice for workloads which require extremely low latency, such as live video transcoding. But if a software encoding solution is right for your workload, Graviton instances can provide the best price-performance for many workloads and competitive price-performance on others. For customers who use spot instances, enabling Graviton instances broadens your choice of instance types so you can always run on the least expensive instances available.
Graviton also remains the most cost-effective processor for many video encoding workloads. Using the same benchmark from the blog post in 2022 and updating to the last versions of the software, Graviton4 achieved the best price-performance among the instance types tested. Graviton powered instances took three of the top five places of the tested instances, demonstrating that even Graviton2, which was announced in 2019, can still be very compelling for many use cases.

Image 2: Price performance compared to C5 for each instance type on an FFmpeg workload which encodes a single 4k input into five different outputs of different sizes using x264 and preset “veryslow.”
Performance Improvements on x265
Contributions to x265, a widely used open source library for encoding H.265 video, improved performance across a range of encoding presets. Some of the most time-consuming functions in video encoders are common to many algorithms, including sum-absolute-difference (SAD), convolution, and discrete cosine transform (DCT). Making performance improvements in these functions has outsized impact on the performance of the encoder as a whole. Improvements in these functions resulted in performance improvements on multiple Graviton generations, but some of the best improvements were seen on Graviton4 which has support for the newest instructions, such as SVE2. These graphs show the percentage improvement of x265 version 4.1 over the release branch of 3.6. Improvement from version 3.5, the release version which was available in 2022, was even more substantial, with Graviton instances reaching 2.3 – 2.5 times faster performance.
Tips to Maximize Performance on Graviton
In order to get the best encoding performance on Graviton, there are few tips that you can follow. The most important thing you can do is to ensure you build your encoder pipeline from the latest source that you can. There have been many contributions recently which improve performance on Graviton, especially in x265 and SVT-AV1. If you are using either of these encoders, test the performance when building from the tip of the main branch. Check back often as well, because work on these projects is still ongoing at the time of writing.
Another thing to do which can improve performance on Graviton is to compile with the newest compiler available for your operating system or better still, install an even newer compiler, such as Clang-17 or later. (The easiest way to do this is usually to install from your distribution’s package manager.) This can boost performance in the newer libraries such as x265, which have many encoder kernels implemented with compiler intrinsics rather than directly in assembly.
For some libraries, especially x265, compiling with a recent version of Clang instead of GCC can provide a significant performance uplift; on a test with C8g, building with Clang on Ubuntu Noble 24.04, using the packaged version of Clang-18 and compared to the packaged version of GCC-13, x265 does 11% better on average on benchmarks.

Image 4: Performance improvement on C8g of x265 when compiled with Clang-18 vs GCC-13 on Ubuntu Noble (24.04).
New Instruction Support Enables More Optimizations
In this section, we discuss in detail how new capabilities of Graviton3 and 4 can improve performance. The NEON instruction set, supported by all Graviton processors, enables 128 bit single-instruction-multiple-data (SIMD) processing. Using these instructions, Graviton can process 16 pixel-channels per instruction when processing 8-bit video. These types of instructions are the foundation of achieving high performance in video encoding on CPUs. Graviton3 adds support for another new set of instructions, Scalable Vector Extension (SVE). Graviton4 builds on this further, adding support for SVE2. While SVE doesn’t increase the total parallel throughput of Graviton, it does introduce a new and simpler way to write SIMD code, which often uses fewer instructions. It also adds instruction types which are not available in NEON that enable more optimizations for video encoding.
For example, in x265 a group of functions called saoCuStatsE0
can make use of an SVE2 instruction on Graviton4 called histseg
which can count the number of bytes which match from each of the bytes of a second vector. This enables the function to build a small frequency table from values in the source data. In order to do this with NEON instructions, you must mask off each value and count each one separately, with a separate instruction.
One of the tasks the saoCuStatsE0
function does is to count the total number of each of the 5 different edge types in the block being processed. To do this with NEON instructions, after using some simple math to compute the edge type, a mask is constructed with compare instructions for each of the 5 edge types. Then each of those 5 masks are added up with a widening pair-wise add instruction, sadalp
, which is limited to half of the SIMD bandwidth on both Graviton3 and Graviton4. This works and helps contribute to a speed boost of 2.7x over the C implementation. However, with the availability of SVE2 in Graviton4, we can replace the 5 compare and the 5 pair-wise add instructions with a single histseg
instruction and standard NEON add
instruction which doesn’t have the bandwidth limitation of the widening pair-wise add
. This contributes to the SVE2 implementation of this function reaching a speed boost of 3.2x over the C implementation.
This is just a single example of the benefits that the new instructions available on Graviton3 and Graviton4 provide. There are other such cases as well, and implementing them has helped Graviton3 and Graviton4 achieve performance benefits over Graviton2 beyond what is possible by increasing chip performance alone. For engineers working on achieving peak performance on Graviton, there is information for writing optimized assembly and intrinsics on our AWS Graviton Technical Guide.
HDR and 10-bit Video
The previous post noted that there was more work to do, especially with HDR content using more than 8 bits of color depth. 10 and 12-bit color depth require twice the SIMD compute compared to 8-bit, and in most cases requires the optimized kernels be implemented separately from the 8-bit versions. Despite this complexity, much work has since been accomplished to accelerate 10-bit encoding. Most of this work has focused on x265 and SVT-AV1, since these codecs are more commonly used to deliver HDR content. For x265, these graphs show the improvement to FPS scores for three Graviton generations, with C8g, C7g, and C6g seeing 12%, 8%, and 10% average improvement, respectively.
Benchmarking Method
In each benchmark represented here, enough instances of the encode are run in parallel in order to fully load the instance under test. This models a workload that optimizes for the lowest encode cost and the total time to encode each video is less important. Except for the FFMpeg scaling benchmark, which was the same as the one used for the 2022 blog post, each benchmark for this blog post took a raw source video file in a y4m container and encoded it directly with the encoder, x264, x265, or SVT-AV1. The encode was run with enough parallel instances to fully load all cores of the system under test.
Conclusion
Customers looking to reduce video encoding costs or boosting performance should consider Graviton3 and Graviton4 powered instances. Graviton4 powered C8g instances are especially well suited to video encoding, boasting 12% performance over Graviton3 and 73% over Graviton2 on x265 benchmarks. In order to get the best performance from Graviton, use the latest sources of your encoders and consider building with Clang instead of GCC. Contributions to these open source projects are continuing, so check for updates to the source packages often. For more information on migrating to Graviton, refer to the AWS Graviton Technical Guide.
