Table of Contents
Fetching ...

Neural Video Compression with Context Modulation

Chuanbo Tang, Zhuoyuan Li, Yifan Bian, Li Li, Dong Liu

TL;DR

This work tackles the challenge of efficiently exploiting temporal redundancy in neural video compression by introducing context modulation, which combines an oriented temporal context derived from the reference frame with the propagated context. The two-step approach—flow orientation to extract inter-frame correlation and context compensation to fuse oriented and propagated contexts under a global-local synergy with decoupling loss—yields a richer temporal context and reduces irrelevant information in the prediction chain. Empirically, the method achieves substantial bitrate savings, up to $22.7\%$ over H.266/VVC and $10.1\%$ over the previous SOTA DCVC-FM, while operating within a conditional coding framework and maintaining competitive complexity. These gains demonstrate the practical potential of refined temporal context modeling for neural video codecs, with future work aimed at learnable warps and more explicit motion priors to further enhance temporal alignment and compression efficiency.

Abstract

Efficient video coding is highly dependent on exploiting the temporal redundancy, which is usually achieved by extracting and leveraging the temporal context in the emerging conditional coding-based neural video codec (NVC). Although the latest NVC has achieved remarkable progress in improving the compression performance, the inherent temporal context propagation mechanism lacks the ability to sufficiently leverage the reference information, limiting further improvement. In this paper, we address the limitation by modulating the temporal context with the reference frame in two steps. Specifically, we first propose the flow orientation to mine the inter-correlation between the reference frame and prediction frame for generating the additional oriented temporal context. Moreover, we introduce the context compensation to leverage the oriented context to modulate the propagated temporal context generated from the propagated reference feature. Through the synergy mechanism and decoupling loss supervision, the irrelevant propagated information can be effectively eliminated to ensure better context modeling. Experimental results demonstrate that our codec achieves on average 22.7% bitrate reduction over the advanced traditional video codec H.266/VVC, and offers an average 10.1% bitrate saving over the previous state-of-the-art NVC DCVC-FM. The code is available at https://github.com/Austin4USTC/DCMVC.

Neural Video Compression with Context Modulation

TL;DR

This work tackles the challenge of efficiently exploiting temporal redundancy in neural video compression by introducing context modulation, which combines an oriented temporal context derived from the reference frame with the propagated context. The two-step approach—flow orientation to extract inter-frame correlation and context compensation to fuse oriented and propagated contexts under a global-local synergy with decoupling loss—yields a richer temporal context and reduces irrelevant information in the prediction chain. Empirically, the method achieves substantial bitrate savings, up to over H.266/VVC and over the previous SOTA DCVC-FM, while operating within a conditional coding framework and maintaining competitive complexity. These gains demonstrate the practical potential of refined temporal context modeling for neural video codecs, with future work aimed at learnable warps and more explicit motion priors to further enhance temporal alignment and compression efficiency.

Abstract

Efficient video coding is highly dependent on exploiting the temporal redundancy, which is usually achieved by extracting and leveraging the temporal context in the emerging conditional coding-based neural video codec (NVC). Although the latest NVC has achieved remarkable progress in improving the compression performance, the inherent temporal context propagation mechanism lacks the ability to sufficiently leverage the reference information, limiting further improvement. In this paper, we address the limitation by modulating the temporal context with the reference frame in two steps. Specifically, we first propose the flow orientation to mine the inter-correlation between the reference frame and prediction frame for generating the additional oriented temporal context. Moreover, we introduce the context compensation to leverage the oriented context to modulate the propagated temporal context generated from the propagated reference feature. Through the synergy mechanism and decoupling loss supervision, the irrelevant propagated information can be effectively eliminated to ensure better context modeling. Experimental results demonstrate that our codec achieves on average 22.7% bitrate reduction over the advanced traditional video codec H.266/VVC, and offers an average 10.1% bitrate saving over the previous state-of-the-art NVC DCVC-FM. The code is available at https://github.com/Austin4USTC/DCMVC.

Paper Structure

This paper contains 15 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison of context generation for our DCMVC (Deep Context Modulation for Video Compression) with previous state-of-the-art compression scheme, DCVC-DC.
  • Figure 2: Our DCMVC framework. Based on the contextual-coding framework, we propose the context modulation to generate the compensated temporal context $\overline{C}_t^0$ with the input of propagated temporal context $C_t^0$, decoded flow $\hat{v}_{t}$, and reference frame $\hat{x}_{t-1}$.
  • Figure 3: (a) Framework overview of the flow orientation in context modulation. $\check{C}_t^0$ is the oriented context generated from reference frame $\hat{x}_{t-1}$ and oriented flow $\check{v}_{t}$. (b) Framework overview of the context compensation in context modulation. ${C}_t^0$ is the context generated through propagated reference feature $F_{t-1}$ and decoded flow $\hat{v}_{t}$. $\check{G}_t^0$ and ${G}_t^0$ are global features extracted from $\check{C}_t^0$ and ${C}_t^0$, respectively, and ${Cor}_G$ is the cosine correlation between $\check{G}_t^0$ and ${G}_t^0$. Similarly, $\check{L}_t^0$ and ${L}_t^0$ are local features extracted from $\check{C}_t^0$ and ${C}_t^0$, respectively, and ${Cor}_L$ is the cosine correlation between $\check{L}_t^0$ and ${L}_t^0$. The decoupling loss function ${L}_{decouple}$ consists of the cosine similarity of both global features and local features. $\overline{C}_t^0$ is the context generated from our proposed context compensation.
  • Figure 4: (a) Visualization of the estimated flow $v_t$ (estimated from reference frame and current frame), decoded flow $\hat{v}_{t}$ (decoded from motion decoder), oriented flow $\check{v}_t$ (generated from reference frame and prediction frame), and their warp frames. (b) Visualization of oriented context $\check{C}_t^0$, propagated context ${C}_t^0$, and compensated context $\overline{C}_t^0$ obtained from the context compensation.
  • Figure 5: Rate and distortion curve for UVG, MCL-JCV, and HEVC Class C datasets. The comparison is in RGB colorspace measured with PSNR, and the intra-period is set as 32.
  • ...and 3 more figures