Resolution-Agnostic Neural Compression for High-Fidelity Portrait Video Conferencing via Implicit Radiance Fields
Yifei Li, Xiaohong Liu, Yicong Peng, Guangtao Zhai, Jun Zhou
TL;DR
This work tackles ultra-low-bandwidth, high-fidelity portrait video conferencing by shifting from pixel-based codecs to a NeRF-based framework using implicit radiance fields. Facial expressions and head pose are encoded as frame substitutions by extracting $\delta \in \mathbb{R}^{79}$ and $p \in \mathbb{R}^{12}$ from 3D Morphable Models, then refined with an attention-based fine-tuning embedding and entropy coding. At the receiver, a feature-conditioned dynamic NeRF, comprising separate head and torso fields, reconstructs portraits via volume rendering, with a learnable head-torso consistency constraint to improve fidelity; the implicit field is denoted $\mathcal{N}_\Theta$ and renders colors $\mathbf{c}$ with density $\sigma$ along rays, using $\mathcal{C}(\mathbf{r};\Theta,P,\delta)=\int \sigma_\Theta(\mathbf{r}(t))\cdot \mathbf{c}_\Theta(\mathbf{r}(t),\mathbf{d})\cdot T(t)\,dt$ where $T(t)=\exp(-\int \sigma(\mathbf{r}(x))dx)$. Experiments on the HDTF dataset show resolution-agnostic, ultra-low bandwidth performance that outperforms HEVC and prior neural compression baselines in both objective metrics (e.g., CSIM, AUCON, PRMSE) and subjective MOS evaluations, highlighting the practical impact for real-time video conferencing and high-resolution reconstruction.
Abstract
Video conferencing has caught much more attention recently. High fidelity and low bandwidth are two major objectives of video compression for video conferencing applications. Most pioneering methods rely on classic video compression codec without high-level feature embedding and thus can not reach the extremely low bandwidth. Recent works instead employ model-based neural compression to acquire ultra-low bitrates using sparse representations of each frame such as facial landmark information, while these approaches can not maintain high fidelity due to 2D image-based warping. In this paper, we propose a novel low bandwidth neural compression approach for high-fidelity portrait video conferencing using implicit radiance fields to achieve both major objectives. We leverage dynamic neural radiance fields to reconstruct high-fidelity talking head with expression features, which are represented as frame substitution for transmission. The overall system employs deep model to encode expression features at the sender and reconstruct portrait at the receiver with volume rendering as decoder for ultra-low bandwidth. In particular, with the characteristic of neural radiance fields based model, our compression approach is resolution-agnostic, which means that the low bandwidth achieved by our approach is independent of video resolution, while maintaining fidelity for higher resolution reconstruction. Experimental results demonstrate that our novel framework can (1) construct ultra-low bandwidth video conferencing, (2) maintain high fidelity portrait and (3) have better performance on high-resolution video compression than previous works.
