Table of Contents
Fetching ...

Reparo: Loss-Resilient Generative Codec for Video Conferencing

Tianhong Li, Vibhaalakshmi Sivaraman, Pantea Karimi, Lijie Fan, Mohammad Alizadeh, Dina Katabi

TL;DR

Packet loss in real-time video conferencing degrades quality and causes freezes, and retransmission is often impractical while traditional FEC struggles with bursty losses. Reparo proposes a generative, loss-resilient codec that encodes frames as token indices using a shared codebook, plus a loss-recovery module that synthesizes missing tokens conditioned on received data and past context, enabling frame-wise independence and constant bitrate. The system comprises a neural codec (encoder/decoder), a deterministic packetizer, a bitrate controller with self-dropping, and a spatio-temporal ViT-based loss recovery module trained with simulated losses. Evaluated on a large, diverse 5-hour video conferencing corpus against state-of-the-art FEC baselines, Reparo achieves higher PSNR/SSIM/LPIPS and dramatically fewer non-rendered frames while maintaining real-time operation, demonstrating the practicality of generative loss recovery for live video communication. This work highlights a path toward loss-resilient video codecs that leverage domain-specific generative models to improve quality without adding redundancy or latency from retransmissions.

Abstract

Packet loss during video conferencing often results in poor quality and video freezing. Retransmitting lost packets is often impractical due to the need for real-time playback, and using Forward Error Correction (FEC) for packet recovery is challenging due to the unpredictable and bursty nature of Internet losses. Excessive redundancy leads to inefficiency and wasted bandwidth, while insufficient redundancy results in undecodable frames, causing video freezes and quality degradation in subsequent frames. We introduce Reparo -- a loss-resilient video conferencing framework based on generative deep learning models to address these issues. Our approach generates missing information when a frame or part of a frame is lost. This generation is conditioned on the data received thus far, considering the model's understanding of how people and objects appear and interact within the visual realm. Experimental results, using publicly available video conferencing datasets, demonstrate that Reparo outperforms state-of-the-art FEC-based video conferencing solutions in terms of both video quality (measured through PSNR, SSIM, and LPIPS) and the occurrence of video freezes.

Reparo: Loss-Resilient Generative Codec for Video Conferencing

TL;DR

Packet loss in real-time video conferencing degrades quality and causes freezes, and retransmission is often impractical while traditional FEC struggles with bursty losses. Reparo proposes a generative, loss-resilient codec that encodes frames as token indices using a shared codebook, plus a loss-recovery module that synthesizes missing tokens conditioned on received data and past context, enabling frame-wise independence and constant bitrate. The system comprises a neural codec (encoder/decoder), a deterministic packetizer, a bitrate controller with self-dropping, and a spatio-temporal ViT-based loss recovery module trained with simulated losses. Evaluated on a large, diverse 5-hour video conferencing corpus against state-of-the-art FEC baselines, Reparo achieves higher PSNR/SSIM/LPIPS and dramatically fewer non-rendered frames while maintaining real-time operation, demonstrating the practicality of generative loss recovery for live video communication. This work highlights a path toward loss-resilient video codecs that leverage domain-specific generative models to improve quality without adding redundancy or latency from retransmissions.

Abstract

Packet loss during video conferencing often results in poor quality and video freezing. Retransmitting lost packets is often impractical due to the need for real-time playback, and using Forward Error Correction (FEC) for packet recovery is challenging due to the unpredictable and bursty nature of Internet losses. Excessive redundancy leads to inefficiency and wasted bandwidth, while insufficient redundancy results in undecodable frames, causing video freezes and quality degradation in subsequent frames. We introduce Reparo -- a loss-resilient video conferencing framework based on generative deep learning models to address these issues. Our approach generates missing information when a frame or part of a frame is lost. This generation is conditioned on the data received thus far, considering the model's understanding of how people and objects appear and interact within the visual realm. Experimental results, using publicly available video conferencing datasets, demonstrate that Reparo outperforms state-of-the-art FEC-based video conferencing solutions in terms of both video quality (measured through PSNR, SSIM, and LPIPS) and the occurrence of video freezes.
Paper Structure (20 sections, 1 equation, 12 figures, 4 tables)

This paper contains 20 sections, 1 equation, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of Reparo. It comprises an encoder-decoder pair responsible for converting RGB frames into quantized tokens and vice versa, as well as new modules for packetization, bitrate control, and loss recovery that operate in the token space.
  • Figure 2: Token-based neural codec. The encoder converts patches from video frames into features and uses a codebook to quantize the features into tokens by finding the nearest neighbor of each feature in the codebook. The decoder then uses the tokens to reconstruct the video frame.
  • Figure 3: The transmitter first uses a deterministic packetizer to wrap image tokens into packets. Then a bitrate controller drops some tokens in each packet to adapt to the target bitrate. The receiver first decodes which tokens are dropped by the bitrate controller. It then depacketizes the received packets to extract the received token indices with the lost tokens identified.
  • Figure 4: Loss recovery module. It uses a neural architecture based on a spatio-temporal vision transformer to generate any lost tokens using the learned knowledge of how people and objects look, along with the received tokens in the current and recent frames.
  • Figure 5: We report the average and worst 10% PSNR, SSIM and LPIPS of baselines and Reparo under different loss levels. PSNR and SSIM are the higher the better, and LPIPS is the lower the better. We vary the target bitrate of Reparo and baselines to cover different achieved bitrates. Reparo's visual quality is significantly better than the baselines under all lossy conditions while achieving similar performance when there is no loss.
  • ...and 7 more figures