Table of Contents
Fetching ...

Pleno-Generation: A Scalable Generative Face Video Compression Framework with Bandwidth Intelligence

Bolin Chen, Hanwei Zhu, Shanzhi Yin, Lingyu Zhu, Jie Chen, Ru-Ling Liao, Shiqi Wang, Yan Ye

TL;DR

This work tackles the challenge of achieving wide bitrate coverage in generative face video compression. It introduces Pleno-Generation (PGen), a scalable representation and layered reconstruction framework that uses a base GFVC-compatible layer plus an enhancement layer with multi-granularity signals and attention-guided generation to enable bandwidth-aware, high-fidelity reconstruction. Key contributions include multi-granularity feature representation, entropy-based feature compression, and an attention-guided signal generator within a scalable SRLR-based architecture, plus extensive experiments showing RD/perceptual gains over VVC and existing GFVC baselines. The results demonstrate that PGen extends the bitrate range, improves perceptual quality, and provides a universal, plug-and-play enhancement for GFVC, with practical implications for bandwidth-efficient, high-fidelity face video communication.

Abstract

Generative model based compact video compression is typically operated within a relative narrow range of bitrates, and often with an emphasis on ultra-low rate applications. There has been an increasing consensus in the video communication industry that full bitrate coverage should be enabled by generative coding. However, this is an extremely difficult task, largely because generation and compression, although related, have distinct goals and trade-offs. The proposed Pleno-Generation (PGen) framework distinguishes itself through its exceptional capabilities in ensuring the robustness of video coding by utilizing a wider range of bandwidth for generation via bandwidth intelligence. In particular, we initiate our research of PGen with face video coding, and PGen offers a paradigm shift that prioritizes high-fidelity reconstruction over pursuing compact bitstream. The novel PGen framework leverages scalable representation and layered reconstruction for Generative Face Video Compression (GFVC), in an attempt to imbue the bitstream with intelligence in different granularity. Experimental results illustrate that the proposed PGen framework can facilitate existing GFVC algorithms to better deliver high-fidelity and faithful face videos. In addition, the proposed framework can allow a greater space of flexibility for coding applications and show superior RD performance with a much wider bitrate range in terms of various quality evaluations. Moreover, in comparison with the latest Versatile Video Coding (VVC) codec, the proposed scheme achieves competitive Bjøntegaard-delta-rate savings for perceptual-level evaluations.

Pleno-Generation: A Scalable Generative Face Video Compression Framework with Bandwidth Intelligence

TL;DR

This work tackles the challenge of achieving wide bitrate coverage in generative face video compression. It introduces Pleno-Generation (PGen), a scalable representation and layered reconstruction framework that uses a base GFVC-compatible layer plus an enhancement layer with multi-granularity signals and attention-guided generation to enable bandwidth-aware, high-fidelity reconstruction. Key contributions include multi-granularity feature representation, entropy-based feature compression, and an attention-guided signal generator within a scalable SRLR-based architecture, plus extensive experiments showing RD/perceptual gains over VVC and existing GFVC baselines. The results demonstrate that PGen extends the bitrate range, improves perceptual quality, and provides a universal, plug-and-play enhancement for GFVC, with practical implications for bandwidth-efficient, high-fidelity face video communication.

Abstract

Generative model based compact video compression is typically operated within a relative narrow range of bitrates, and often with an emphasis on ultra-low rate applications. There has been an increasing consensus in the video communication industry that full bitrate coverage should be enabled by generative coding. However, this is an extremely difficult task, largely because generation and compression, although related, have distinct goals and trade-offs. The proposed Pleno-Generation (PGen) framework distinguishes itself through its exceptional capabilities in ensuring the robustness of video coding by utilizing a wider range of bandwidth for generation via bandwidth intelligence. In particular, we initiate our research of PGen with face video coding, and PGen offers a paradigm shift that prioritizes high-fidelity reconstruction over pursuing compact bitstream. The novel PGen framework leverages scalable representation and layered reconstruction for Generative Face Video Compression (GFVC), in an attempt to imbue the bitstream with intelligence in different granularity. Experimental results illustrate that the proposed PGen framework can facilitate existing GFVC algorithms to better deliver high-fidelity and faithful face videos. In addition, the proposed framework can allow a greater space of flexibility for coding applications and show superior RD performance with a much wider bitrate range in terms of various quality evaluations. Moreover, in comparison with the latest Versatile Video Coding (VVC) codec, the proposed scheme achieves competitive Bjøntegaard-delta-rate savings for perceptual-level evaluations.

Paper Structure

This paper contains 25 sections, 12 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: The general flowchart and different facial representations for GFVC chen2023generative.
  • Figure 2: Comparisons and analyses of rate-distortion limitations for VVC codec and latest GFVC systems in terms of perceptual-level and pixel-level measures.
  • Figure 3: Limitations of visual reconstruction quality for GFVC systems, where these generated faces are sourced from FOMM, CFTE and FV2V.
  • Figure 4: Overview of the proposed Pleno-Generation (PGen) framework based on scalable representation and layered reconstruction for bandwidth-intelligent face video communication.
  • Figure 5: The detailed enhancement-layer architecture to achieve multi-granularity facial feature description, entropy-based feature compression and attention-guided signal generation (i.e., attention-guided face signal enhancement and coarse-to-fine frame generation). In particular, Q, AE, AD and Conv denote the quantization, arithmetic encoding, arithmetic decoding and convolution processes, respectively.
  • ...and 8 more figures