Table of Contents
Fetching ...

Towards Practical Real-Time Neural Video Compression

Zhaoyang Jia, Bin Li, Jiahao Li, Wenxuan Xie, Linfeng Qi, Houqiang Li, Yan Lu

TL;DR

This paper targets the practical deployment gap in neural video codecs by identifying operational complexity—particularly latent size $P_{size}$ and module count $P_{num}$—as the primary bottleneck to real-time coding. It introduces DCVC-RT, a real-time NVC that uses implicit temporal modelling, single low-resolution latents (e.g., $1/8$), module-bank rate control, and model integerization to achieve high RD performance with low latency. Empirically, DCVC-RT delivers real-time 1080p coding on consumer and data-center GPUs (e.g., around $125$ fps encode / $113$ fps decode on $A100$) with about a $21\%$ BD-Rate reduction versus H.266/VTM, while maintaining far higher speeds than prior approaches. The work demonstrates a practical path toward deployable NVCs and lays groundwork for further hardware-aware optimizations and larger models to close remaining gaps in high-bitrate regimes.

Abstract

We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. In practice, the coding speed of NVCs depends on 1) computational costs, and 2) non-computational operational costs, such as memory I/O and the number of function calls. While most efficient NVCs prioritize reducing computational cost, we identify operational cost as the primary bottleneck to achieving higher coding speed. Leveraging this insight, we introduce a set of efficiency-driven design improvements focused on minimizing operational costs. Specifically, we employ implicit temporal modeling to eliminate complex explicit motion modules, and use single low-resolution latent representations rather than progressive downsampling. These innovations significantly accelerate NVC without sacrificing compression quality. Additionally, we implement model integerization for consistent cross-device coding and a module-bank-based rate control scheme to improve practical adaptability. Experiments show our proposed DCVC-RT achieves an impressive average encoding/decoding speed at 125.2/112.8 fps (frames per second) for 1080p video, while saving an average of 21% in bitrate compared to H.266/VTM. The code is available at https://github.com/microsoft/DCVC.

Towards Practical Real-Time Neural Video Compression

TL;DR

This paper targets the practical deployment gap in neural video codecs by identifying operational complexity—particularly latent size and module count —as the primary bottleneck to real-time coding. It introduces DCVC-RT, a real-time NVC that uses implicit temporal modelling, single low-resolution latents (e.g., ), module-bank rate control, and model integerization to achieve high RD performance with low latency. Empirically, DCVC-RT delivers real-time 1080p coding on consumer and data-center GPUs (e.g., around fps encode / fps decode on ) with about a BD-Rate reduction versus H.266/VTM, while maintaining far higher speeds than prior approaches. The work demonstrates a practical path toward deployable NVCs and lays groundwork for further hardware-aware optimizations and larger models to close remaining gaps in high-bitrate regimes.

Abstract

We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. In practice, the coding speed of NVCs depends on 1) computational costs, and 2) non-computational operational costs, such as memory I/O and the number of function calls. While most efficient NVCs prioritize reducing computational cost, we identify operational cost as the primary bottleneck to achieving higher coding speed. Leveraging this insight, we introduce a set of efficiency-driven design improvements focused on minimizing operational costs. Specifically, we employ implicit temporal modeling to eliminate complex explicit motion modules, and use single low-resolution latent representations rather than progressive downsampling. These innovations significantly accelerate NVC without sacrificing compression quality. Additionally, we implement model integerization for consistent cross-device coding and a module-bank-based rate control scheme to improve practical adaptability. Experiments show our proposed DCVC-RT achieves an impressive average encoding/decoding speed at 125.2/112.8 fps (frames per second) for 1080p video, while saving an average of 21% in bitrate compared to H.266/VTM. The code is available at https://github.com/microsoft/DCVC.

Paper Structure

This paper contains 31 sections, 6 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Towards practical real-time neural video codecs (NVCs). Recent advanced NVCs have demonstrated either excellent rate-distortion performance, or improved versatility like integrated cross-device coding consistency or rate-control capabilities. In this paper, we further address the core obstacles of achieving real-time coding to close the last mile toward a practical NVC solution. Our DCVC-RT not only achieves state-of-the-art compression ratio but is also deployable on consumer devices for real-time video coding.
  • Figure 2: Paradigm shift. To enhance efficiency, we eliminate explicit motion-related modules and adopt implicit temporal modeling. We also propose learning latent representations at a single low resolution, replacing the traditional progressive downsampling approach. Additionally, DCVC-RT supports integerization for cross-device consistency and incorporates a module-bank-based rate-control mechanism.
  • Figure 3: Analysis on computational complexity $P_{comp}$ and operational complexity, including latent representation size $P_{size}$ and number of modules $P_{num}$. (a) Reducing channels results in a quadratic decrease in $P_{comp}$, yet inference time decreases almost linearly, indicating that computational cost is not the primary speed bottleneck. (b) We independently reduce one of $P_{comp}$, $P_{size}$ and $P_{num}$ to identify the main factors affecting time cost. Results show that $P_{size}$ is most critical at high computational complexity, while the $P_{num}$ is more significant at low computational complexity.
  • Figure 4: Framework overview. DC Block, Q, AE and AD represent depth-wise convolution block, quantization, arithmetic encoder and decoder, respectively. $F_{t-1}$ and $F^e_{t-1}$ are temporal contexts extracted from previously decoded latent $f_{t-1}$. Frames are transformed into latents at 1/8 resolution using patch embedding dosovitskiy2021an, and key modules such as the encoder, decoder, frame extractor, and reconstruction generation operate at this single scale for efficient feature learning. DCVC-RT eliminates explicit motion modeling, resulting in a streamlined design with drastically reduced operational complexity and real-time performance.
  • Figure 5: Analysis of different components. (a) Ablation study on learning latent representations at a single resolution. All models maintain equal computational complexity (MACs) for fairness. (b) Example of probability estimation of $z$. Using a module bank instead of a single factorized module achieves an average bit savings of 3.4%. (c) Cross platform coding test. We perform encoding on an NVIDIA A100 GPU, while decoding uses an RTX 2080Ti. Model integerization effectively eliminates coding inconsistencies across platforms.
  • ...and 8 more figures