Table of Contents
Fetching ...

Resolution-Agnostic Neural Compression for High-Fidelity Portrait Video Conferencing via Implicit Radiance Fields

Yifei Li, Xiaohong Liu, Yicong Peng, Guangtao Zhai, Jun Zhou

TL;DR

This work tackles ultra-low-bandwidth, high-fidelity portrait video conferencing by shifting from pixel-based codecs to a NeRF-based framework using implicit radiance fields. Facial expressions and head pose are encoded as frame substitutions by extracting $\delta \in \mathbb{R}^{79}$ and $p \in \mathbb{R}^{12}$ from 3D Morphable Models, then refined with an attention-based fine-tuning embedding and entropy coding. At the receiver, a feature-conditioned dynamic NeRF, comprising separate head and torso fields, reconstructs portraits via volume rendering, with a learnable head-torso consistency constraint to improve fidelity; the implicit field is denoted $\mathcal{N}_\Theta$ and renders colors $\mathbf{c}$ with density $\sigma$ along rays, using $\mathcal{C}(\mathbf{r};\Theta,P,\delta)=\int \sigma_\Theta(\mathbf{r}(t))\cdot \mathbf{c}_\Theta(\mathbf{r}(t),\mathbf{d})\cdot T(t)\,dt$ where $T(t)=\exp(-\int \sigma(\mathbf{r}(x))dx)$. Experiments on the HDTF dataset show resolution-agnostic, ultra-low bandwidth performance that outperforms HEVC and prior neural compression baselines in both objective metrics (e.g., CSIM, AUCON, PRMSE) and subjective MOS evaluations, highlighting the practical impact for real-time video conferencing and high-resolution reconstruction.

Abstract

Video conferencing has caught much more attention recently. High fidelity and low bandwidth are two major objectives of video compression for video conferencing applications. Most pioneering methods rely on classic video compression codec without high-level feature embedding and thus can not reach the extremely low bandwidth. Recent works instead employ model-based neural compression to acquire ultra-low bitrates using sparse representations of each frame such as facial landmark information, while these approaches can not maintain high fidelity due to 2D image-based warping. In this paper, we propose a novel low bandwidth neural compression approach for high-fidelity portrait video conferencing using implicit radiance fields to achieve both major objectives. We leverage dynamic neural radiance fields to reconstruct high-fidelity talking head with expression features, which are represented as frame substitution for transmission. The overall system employs deep model to encode expression features at the sender and reconstruct portrait at the receiver with volume rendering as decoder for ultra-low bandwidth. In particular, with the characteristic of neural radiance fields based model, our compression approach is resolution-agnostic, which means that the low bandwidth achieved by our approach is independent of video resolution, while maintaining fidelity for higher resolution reconstruction. Experimental results demonstrate that our novel framework can (1) construct ultra-low bandwidth video conferencing, (2) maintain high fidelity portrait and (3) have better performance on high-resolution video compression than previous works.

Resolution-Agnostic Neural Compression for High-Fidelity Portrait Video Conferencing via Implicit Radiance Fields

TL;DR

This work tackles ultra-low-bandwidth, high-fidelity portrait video conferencing by shifting from pixel-based codecs to a NeRF-based framework using implicit radiance fields. Facial expressions and head pose are encoded as frame substitutions by extracting and from 3D Morphable Models, then refined with an attention-based fine-tuning embedding and entropy coding. At the receiver, a feature-conditioned dynamic NeRF, comprising separate head and torso fields, reconstructs portraits via volume rendering, with a learnable head-torso consistency constraint to improve fidelity; the implicit field is denoted and renders colors with density along rays, using where . Experiments on the HDTF dataset show resolution-agnostic, ultra-low bandwidth performance that outperforms HEVC and prior neural compression baselines in both objective metrics (e.g., CSIM, AUCON, PRMSE) and subjective MOS evaluations, highlighting the practical impact for real-time video conferencing and high-resolution reconstruction.

Abstract

Video conferencing has caught much more attention recently. High fidelity and low bandwidth are two major objectives of video compression for video conferencing applications. Most pioneering methods rely on classic video compression codec without high-level feature embedding and thus can not reach the extremely low bandwidth. Recent works instead employ model-based neural compression to acquire ultra-low bitrates using sparse representations of each frame such as facial landmark information, while these approaches can not maintain high fidelity due to 2D image-based warping. In this paper, we propose a novel low bandwidth neural compression approach for high-fidelity portrait video conferencing using implicit radiance fields to achieve both major objectives. We leverage dynamic neural radiance fields to reconstruct high-fidelity talking head with expression features, which are represented as frame substitution for transmission. The overall system employs deep model to encode expression features at the sender and reconstruct portrait at the receiver with volume rendering as decoder for ultra-low bandwidth. In particular, with the characteristic of neural radiance fields based model, our compression approach is resolution-agnostic, which means that the low bandwidth achieved by our approach is independent of video resolution, while maintaining fidelity for higher resolution reconstruction. Experimental results demonstrate that our novel framework can (1) construct ultra-low bandwidth video conferencing, (2) maintain high fidelity portrait and (3) have better performance on high-resolution video compression than previous works.
Paper Structure (28 sections, 5 equations, 7 figures, 3 tables)

This paper contains 28 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of our NeRF-based video compression. The core idea of our framework is frame-feature substitution for extremely low bandwidth. With NeRF-based face reconstruction model ensuring high-fidelity portrait generation, our framework shows significant compression performance for video conferencing application.
  • Figure 2: The overall framework of our proposed method. Face feature is extracted at the sender and substitutes frame to be transmitted with ultra-low bandwidth. At the receiver, NeRF-based model takes the received feature as input to reconstruct portrait frame.
  • Figure 3: Training scheme of the NeRF-based reconstruction model. We leverage consistency constraint code to get better generative results.
  • Figure 4: Qualitative results of the proposed framework compared with previous model-based compression (FOMM fomm and Bi-layer bilayer) and classic video codec (HEVC hevc). Our approach, which employs NeRF-based model for high-fidelity reconstruction and feature-frame substitution for ultra-low bandwidth, outperforms other methods in image quality significantly. $f.t.$ represents fine-tuning embedding employed in the framework.
  • Figure 5: Rate-distortion curve for our proposed framework compared with existing model-based compression method and classic codec HEVC. The resolution for HEVC codec is $256\times 256$.
  • ...and 2 more figures