Table of Contents
Fetching ...

High-Fidelity Causal Video Diffusion Models for Real-Time Ultra-Low-Bitrate Semantic Communication

Cem Eteke, Batuhan Tosun, Alexander Griessel, Wolfgang Kellerer, Eckehard Steinbach

TL;DR

Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates, outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.

Abstract

We introduce a video diffusion model for high-fidelity, causal, and real-time video generation under ultra-low-bitrate semantic communication constraints. Our approach utilizes lossy semantic video coding to transmit the semantic scene structure, complemented by a stream of highly compressed, low-resolution frames that provide sufficient texture information to preserve fidelity. Building on these inputs, we introduce a modular video diffusion model that contains Semantic Control, Restoration Adapter, and Temporal Adapter. We further introduce an efficient temporal distillation procedure that enables extension to real-time and causal synthesis, reducing trainable parameters by 300x and training time by 2x, while adhering to communication constraints. Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates (< 0.0003 bpp), outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.

High-Fidelity Causal Video Diffusion Models for Real-Time Ultra-Low-Bitrate Semantic Communication

TL;DR

Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates, outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.

Abstract

We introduce a video diffusion model for high-fidelity, causal, and real-time video generation under ultra-low-bitrate semantic communication constraints. Our approach utilizes lossy semantic video coding to transmit the semantic scene structure, complemented by a stream of highly compressed, low-resolution frames that provide sufficient texture information to preserve fidelity. Building on these inputs, we introduce a modular video diffusion model that contains Semantic Control, Restoration Adapter, and Temporal Adapter. We further introduce an efficient temporal distillation procedure that enables extension to real-time and causal synthesis, reducing trainable parameters by 300x and training time by 2x, while adhering to communication constraints. Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates (< 0.0003 bpp), outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.
Paper Structure (47 sections, 24 equations, 11 figures, 2 tables)

This paper contains 47 sections, 24 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Overview of our framework: ultra-low-bitrate scene semantics guide generation via Semantic Control, compressed low-resolution frames provide appearance cues through the Restoration Adapter, an efficiently distilled Temporal Adapter enables causal synthesis, and caching accelerates generation.
  • Figure 2: Semantic video coding pipeline. Contours extracted from semantic object masks are simplified with a tolerance $\xi$, then differentially encoded, quantized to $Q$ symbols, and entropy-coded. For P-frames, selected via the I-frame period $P$, only semantic motion is transmitted.
  • Figure 3: The overall architecture of our video diffusion model that extends a frozen backbone. The model takes as input the diffused latents $x_t^k$, degraded latents $\tilde{x}^k$, and semantics $Y^k$ of frame $k$. Semantic Control injects features extracted from $Y^k$ into the backbone. The Restoration Adapter uses degraded features $\tilde{z}^k$ as queries in a restoration attention module. Temporal Adapter applies temporal attention between the current features $z_t^k$ and cached features $z_t^{k-W:k-1}$ causally.
  • Figure 4: Efficient distillation of the Temporal Adapter $\phi^-$. Red-dotted lines denote the gradients.
  • Figure 5: Example videos from the YCB-Sim dataset.
  • ...and 6 more figures