Towards Practical Real-Time Neural Video Compression
Zhaoyang Jia, Bin Li, Jiahao Li, Wenxuan Xie, Linfeng Qi, Houqiang Li, Yan Lu
TL;DR
This paper targets the practical deployment gap in neural video codecs by identifying operational complexity—particularly latent size $P_{size}$ and module count $P_{num}$—as the primary bottleneck to real-time coding. It introduces DCVC-RT, a real-time NVC that uses implicit temporal modelling, single low-resolution latents (e.g., $1/8$), module-bank rate control, and model integerization to achieve high RD performance with low latency. Empirically, DCVC-RT delivers real-time 1080p coding on consumer and data-center GPUs (e.g., around $125$ fps encode / $113$ fps decode on $A100$) with about a $21\%$ BD-Rate reduction versus H.266/VTM, while maintaining far higher speeds than prior approaches. The work demonstrates a practical path toward deployable NVCs and lays groundwork for further hardware-aware optimizations and larger models to close remaining gaps in high-bitrate regimes.
Abstract
We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. In practice, the coding speed of NVCs depends on 1) computational costs, and 2) non-computational operational costs, such as memory I/O and the number of function calls. While most efficient NVCs prioritize reducing computational cost, we identify operational cost as the primary bottleneck to achieving higher coding speed. Leveraging this insight, we introduce a set of efficiency-driven design improvements focused on minimizing operational costs. Specifically, we employ implicit temporal modeling to eliminate complex explicit motion modules, and use single low-resolution latent representations rather than progressive downsampling. These innovations significantly accelerate NVC without sacrificing compression quality. Additionally, we implement model integerization for consistent cross-device coding and a module-bank-based rate control scheme to improve practical adaptability. Experiments show our proposed DCVC-RT achieves an impressive average encoding/decoding speed at 125.2/112.8 fps (frames per second) for 1080p video, while saving an average of 21% in bitrate compared to H.266/VTM. The code is available at https://github.com/microsoft/DCVC.
