Table of Contents
Fetching ...

Multi-Reference Generative Face Video Compression with Contrastive Learning

Goluck Konuko, Giuseppe Valenzise

TL;DR

This paper tackles reconstruction drift in generative face video compression by introducing MRDAC, a Multi-Reference Deep Animation Codec that leverages multiple references and a contrastive learning objective to align reference-derived features. By predicting dense motion and occlusion for each reference and aggregating with temporal-distance weighting, MRDAC enhances motion prediction and enables both longer sequences with fewer references and improved accuracy at comparable bitrates. The approach yields superior rate-distortion and perceptual metrics, with notable gains in bi-directional prediction and robustness to large pose/expression changes, suggesting practical improvements for low-latency video conferencing. The work also analyzes reference-buffering strategies and discusses latency–accuracy trade-offs, indicating MRDAC’s potential for integration with hybrid residual schemes in GFVC pipelines.

Abstract

Generative face video coding (GFVC) has been demonstrated as a potential approach to low-latency, low bitrate video conferencing. GFVC frameworks achieve an extreme gain in coding efficiency with over 70% bitrate savings when compared to conventional codecs at bitrates below 10kbps. In recent MPEG/JVET standardization efforts, all the information required to reconstruct video sequences using GFVC frameworks are adopted as part of the supplemental enhancement information (SEI) in existing compression pipelines. In light of this development, we aim to address a challenge that has been weakly addressed in prior GFVC frameworks, i.e., reconstruction drift as the distance between the reference and target frames increases. This challenge creates the need to update the reference buffer more frequently by transmitting more Intra-refresh frames, which are the most expensive element of the GFVC bitstream. To overcome this problem, we propose instead multiple reference animation as a robust approach to minimizing reconstruction drift, especially when used in a bi-directional prediction mode. Further, we propose a contrastive learning formulation for multi-reference animation. We observe that using a contrastive learning framework enhances the representation capabilities of the animation generator. The resulting framework, MRDAC (Multi-Reference Deep Animation Codec) can therefore be used to compress longer sequences with fewer reference frames or achieve a significant gain in reconstruction accuracy at comparable bitrates to previous frameworks. Quantitative and qualitative results show significant coding and reconstruction quality gains compared to previous GFVC methods, and more accurate animation quality in presence of large pose and facial expression changes.

Multi-Reference Generative Face Video Compression with Contrastive Learning

TL;DR

This paper tackles reconstruction drift in generative face video compression by introducing MRDAC, a Multi-Reference Deep Animation Codec that leverages multiple references and a contrastive learning objective to align reference-derived features. By predicting dense motion and occlusion for each reference and aggregating with temporal-distance weighting, MRDAC enhances motion prediction and enables both longer sequences with fewer references and improved accuracy at comparable bitrates. The approach yields superior rate-distortion and perceptual metrics, with notable gains in bi-directional prediction and robustness to large pose/expression changes, suggesting practical improvements for low-latency video conferencing. The work also analyzes reference-buffering strategies and discusses latency–accuracy trade-offs, indicating MRDAC’s potential for integration with hybrid residual schemes in GFVC pipelines.

Abstract

Generative face video coding (GFVC) has been demonstrated as a potential approach to low-latency, low bitrate video conferencing. GFVC frameworks achieve an extreme gain in coding efficiency with over 70% bitrate savings when compared to conventional codecs at bitrates below 10kbps. In recent MPEG/JVET standardization efforts, all the information required to reconstruct video sequences using GFVC frameworks are adopted as part of the supplemental enhancement information (SEI) in existing compression pipelines. In light of this development, we aim to address a challenge that has been weakly addressed in prior GFVC frameworks, i.e., reconstruction drift as the distance between the reference and target frames increases. This challenge creates the need to update the reference buffer more frequently by transmitting more Intra-refresh frames, which are the most expensive element of the GFVC bitstream. To overcome this problem, we propose instead multiple reference animation as a robust approach to minimizing reconstruction drift, especially when used in a bi-directional prediction mode. Further, we propose a contrastive learning formulation for multi-reference animation. We observe that using a contrastive learning framework enhances the representation capabilities of the animation generator. The resulting framework, MRDAC (Multi-Reference Deep Animation Codec) can therefore be used to compress longer sequences with fewer reference frames or achieve a significant gain in reconstruction accuracy at comparable bitrates to previous frameworks. Quantitative and qualitative results show significant coding and reconstruction quality gains compared to previous GFVC methods, and more accurate animation quality in presence of large pose and facial expression changes.
Paper Structure (16 sections, 3 equations, 5 figures, 1 table)

This paper contains 16 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Proposed multi-reference deep animation codec (MRDAC): For a target video sequence, a group of keyframes ($X^{r}_{1},\ldots,X^{r}_{N}$) are used to predict an aggregate feature representation from which a target frame is predicted. The motion information between each reference frame and the target is predicted independently. We propose a loss function that maximizes agreement between the feature representation $\varepsilon_i$ derived from each reference frame.
  • Figure 2: Average RD Performance: Within the standardized GFVC setting, a long sequence is animated from a single reference frame leading to ultra-low bitrate compression. Our framework progressively introduces new reference frames and jointly uses them to animate the subsequent frames leading to a significantly higher average reconstruction performance.
  • Figure 3: Visual illustration of improved reconstruction accuracy with our proposed coding framework. Accumulating references frames and using them to jointly animate the target frames leads to higher accuracy in pixel level details and higher perceptual fidelity.
  • Figure 4: Bi-Directional prediction with Multi-Reference Animation
  • Figure 5: Evaluating the effectiveness of coding delay in multi-reference animation:Using only the past reference frames (RRB) provides incremental gains in but does not effectively capture future poses and motion characteristics. Using a reference pre-selection (RP) provides a better motion model and hence higher accuracy. However using a combination of the two strategies achieves the highest accuracy.