Table of Contents
Fetching ...

Audio-Visual Cross-Modal Compression for Generative Face Video Coding

Youmin Xu, Mengxi Guo, Shijie Zhao, Weiqi Li, Junlin Li, Li Zhang, Jian Zhang

TL;DR

Experiments show that AVCC significantly outperforms the Versatile Video Coding standard and state-of-the-art GFVC schemes in rate-distortion performance, paving the way for more efficient multimodal communication systems.

Abstract

Generative face video coding (GFVC) is vital for modern applications like video conferencing, yet existing methods primarily focus on video motion while neglecting the significant bitrate contribution of audio. Despite the well-established correlation between audio and lip movements, this cross-modal coherence has not been systematically exploited for compression. To address this, we propose an Audio-Visual Cross-Modal Compression (AVCC) framework that jointly compresses audio and video streams. Our framework extracts motion information from video and tokenizes audio features, then aligns them through a unified audio-video diffusion process. This allows synchronized reconstruction of both modalities from a shared representation. In extremely low-rate scenarios, AVCC can even reconstruct one modality from the other. Experiments show that AVCC significantly outperforms the Versatile Video Coding (VVC) standard and state-of-the-art GFVC schemes in rate-distortion performance, paving the way for more efficient multimodal communication systems.

Audio-Visual Cross-Modal Compression for Generative Face Video Coding

TL;DR

Experiments show that AVCC significantly outperforms the Versatile Video Coding standard and state-of-the-art GFVC schemes in rate-distortion performance, paving the way for more efficient multimodal communication systems.

Abstract

Generative face video coding (GFVC) is vital for modern applications like video conferencing, yet existing methods primarily focus on video motion while neglecting the significant bitrate contribution of audio. Despite the well-established correlation between audio and lip movements, this cross-modal coherence has not been systematically exploited for compression. To address this, we propose an Audio-Visual Cross-Modal Compression (AVCC) framework that jointly compresses audio and video streams. Our framework extracts motion information from video and tokenizes audio features, then aligns them through a unified audio-video diffusion process. This allows synchronized reconstruction of both modalities from a shared representation. In extremely low-rate scenarios, AVCC can even reconstruct one modality from the other. Experiments show that AVCC significantly outperforms the Versatile Video Coding (VVC) standard and state-of-the-art GFVC schemes in rate-distortion performance, paving the way for more efficient multimodal communication systems.

Paper Structure

This paper contains 16 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison between (a) existing generative face video coding, which processes audio and video independently, and (b) our proposed AVCC framework. AVCC constructs a mutual representation from audio and video temporal features to assist the joint encoding and decoding of both streams.
  • Figure 2: Overall framework of our proposed AVCC method. The encoder jointly compresses audio and video inputs using cross-modal information. The decoder uses a unified diffusion process to reconstruct synchronized audio and motion vectors, which drive the generation of video frames from a transmitted key frame.
  • Figure 3: Rate-Distortion (RD) performance comparison. AVCC is benchmarked against VVC and state-of-the-art GFVC methods using three different perceptual metrics: (a) 1-DISTS, (b) 1-LPIPS, and (c) 200-FID. Our method consistently achieves superior reconstruction quality (higher scores) at similar bitrates across all metrics.
  • Figure 4: Visual quality comparisons. At an ultra-low bitrate of 2 kbps, our AVCC method preserves facial details and expression more effectively than baseline methods, resulting in better perceptual quality (lower DISTS score).
  • Figure 5: 3D Rate-Distortion (RD) surface relating bitrate (X), Video-PSNR (Y), and Audio-SDR (Z). The AVCC surface's contortion indicates a strong correlation between audio and video quality. The contour plots (right) confirm this: CTTR shows no correlation, while AVCC shows a clear positive relationship, especially at low bitrates (red region).
  • ...and 1 more figures