Table of Contents
Fetching ...

VineetVC: Adaptive Video Conferencing Under Severe Bandwidth Constraints Using Audio-Driven Talking-Head Reconstruction

Vineet Kumar Rakesh, Soumya Mazumdar, Tapas Samanta, Hemendra Kumar Pandey, Amitabha Das, Sarbajit Pal

TL;DR

The paper tackles real-time video conferencing under severely constrained bandwidth by proposing VineetVC, an adaptive system that blends standard WebRTC transmission with an audio-driven talking-head reconstruction path. A telemetry-guided, hysteresis-based controller switches among Normal, Low-Bitrate, and AI modes, reassigning bitrate from pixel video to compact control and reference updates when needed. Key contributions include the three-mode bandwidth policy, a closed-loop capacity proxy driven by WebRTC statistics, and backend-agnostic talking-head synthesis, with extensive long-run logs demonstrating substantial bandwidth reduction and maintained conversational continuity. The work highlights practical benefits, privacy considerations, and deployment trade-offs, offering a path toward persistent conferencing in challenging networks and outlining future work to enhance robustness and multi-speaker scenarios.

Abstract

Intense bandwidth depletion within consumer and constrained networks has the potential to undermine the stability of real-time video conferencing: encoder rate management becomes saturated, packet loss escalates, frame rates deteriorate, and end-to-end latency significantly increases. This work delineates an adaptive conferencing system that integrates WebRTC media delivery with a supplementary audio-driven talking-head reconstruction pathway and telemetry-driven mode regulation. The system consists of a WebSocket signaling service, an optional SFU for multi-party transmission, a browser client capable of real-time WebRTC statistics extraction and CSV telemetry export, and an AI REST service that processes a reference face image and recorded audio to produce a synthesized MP4; the browser can substitute its outbound camera track with the synthesized stream with a median bandwidth of 32.80 kbps. The solution incorporates a bandwidth-mode switching strategy and a client-side mode-state logger.

VineetVC: Adaptive Video Conferencing Under Severe Bandwidth Constraints Using Audio-Driven Talking-Head Reconstruction

TL;DR

The paper tackles real-time video conferencing under severely constrained bandwidth by proposing VineetVC, an adaptive system that blends standard WebRTC transmission with an audio-driven talking-head reconstruction path. A telemetry-guided, hysteresis-based controller switches among Normal, Low-Bitrate, and AI modes, reassigning bitrate from pixel video to compact control and reference updates when needed. Key contributions include the three-mode bandwidth policy, a closed-loop capacity proxy driven by WebRTC statistics, and backend-agnostic talking-head synthesis, with extensive long-run logs demonstrating substantial bandwidth reduction and maintained conversational continuity. The work highlights practical benefits, privacy considerations, and deployment trade-offs, offering a path toward persistent conferencing in challenging networks and outlining future work to enhance robustness and multi-speaker scenarios.

Abstract

Intense bandwidth depletion within consumer and constrained networks has the potential to undermine the stability of real-time video conferencing: encoder rate management becomes saturated, packet loss escalates, frame rates deteriorate, and end-to-end latency significantly increases. This work delineates an adaptive conferencing system that integrates WebRTC media delivery with a supplementary audio-driven talking-head reconstruction pathway and telemetry-driven mode regulation. The system consists of a WebSocket signaling service, an optional SFU for multi-party transmission, a browser client capable of real-time WebRTC statistics extraction and CSV telemetry export, and an AI REST service that processes a reference face image and recorded audio to produce a synthesized MP4; the browser can substitute its outbound camera track with the synthesized stream with a median bandwidth of 32.80 kbps. The solution incorporates a bandwidth-mode switching strategy and a client-side mode-state logger.
Paper Structure (8 sections, 28 equations, 2 figures, 2 tables)

This paper contains 8 sections, 28 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: System overview of the proposed bandwidth-adaptive video conferencing framework. After browser-based signaling and WebRTC session establishment, the sender periodically evaluates bandwidth/QoS and applies a three-mode policy: BR1 (standard A/V), BR2 (rate-constrained A/V), and BR3 (audio-only uplink with server-side talking-head synthesis and track replacement). Delivery is via SRTP P2P or an optional SFU.
  • Figure 2: Comparative analysis of throughput (kbps) for Standard WebRTC vs. AI Synthesis. Note the significant bandwidth saving achieved during the AI Synthesis phase compared to the Standard WebRTC baseline.