Table of Contents
Fetching ...

Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance

Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Zihan Zheng, Yuan Zhang, Yan Lu

TL;DR

The paper tackles the challenge of delivering perceptually convincing video reconstructions at ultra-low bitrates by introducing S2VC, a single-step diffusion-based video codec within a conditional coding framework. It innovates with Contextual Semantic Guidance to provide frame-adaptive, stable semantic conditioning derived from buffered features, and Temporal Consistency Guidance to enforce cross-frame coherence via multi-scale diffusion blocks and cascade training. Empirical results show S2VC achieving state-of-the-art perceptual quality and substantial bitrate savings (average 52.73% in DISTS) across benchmark datasets, with strong performance on motion-aware and realism-oriented metrics. The approach demonstrates the viability of single-step diffusion for practical, high-quality video compression while highlighting avenues for expanding the effective bitrate range.

Abstract

While traditional and neural video codecs (NVCs) have achieved remarkable rate-distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S2VC, a Single-Step diffusion based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. It replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S2VC delivers state-of-the-art perceptual quality with an average 52.73% bitrate saving over prior perceptual methods, underscoring the promise of single-step diffusion for efficient, high-quality video compression.

Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance

TL;DR

The paper tackles the challenge of delivering perceptually convincing video reconstructions at ultra-low bitrates by introducing S2VC, a single-step diffusion-based video codec within a conditional coding framework. It innovates with Contextual Semantic Guidance to provide frame-adaptive, stable semantic conditioning derived from buffered features, and Temporal Consistency Guidance to enforce cross-frame coherence via multi-scale diffusion blocks and cascade training. Empirical results show S2VC achieving state-of-the-art perceptual quality and substantial bitrate savings (average 52.73% in DISTS) across benchmark datasets, with strong performance on motion-aware and realism-oriented metrics. The approach demonstrates the viability of single-step diffusion for practical, high-quality video compression while highlighting avenues for expanding the effective bitrate range.

Abstract

While traditional and neural video codecs (NVCs) have achieved remarkable rate-distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S2VC, a Single-Step diffusion based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. It replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S2VC delivers state-of-the-art perceptual quality with an average 52.73% bitrate saving over prior perceptual methods, underscoring the promise of single-step diffusion for efficient, high-quality video compression.

Paper Structure

This paper contains 20 sections, 7 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Top: Our conditional video codec adopts a single-step diffusion model, which is especially critical for video, where multi-step diffusion would make sampling many frames prohibitively expensive. Bottom: Comparison of semantic guidance. Fixed text prompts cannot adapt to dynamic video content, while captions lack fine-grained details. Our contextual semantic guidance provides frame-wise detailed information without requiring additional caption or embedding models.
  • Figure 2: Example of decoded frames. S$^2$VC delivers best perceptual quality while maintaining the lowest bitrate. In contrast, traditional codecs bross2021overviewjvet2025ecm show blocking phenomenon, DCVC-FM li2024neural blurs details, and PLVC yang2022perceptual introduces artifacts.
  • Figure 3: Overview of the S$^2$VC framework. The feature buffer supports conditional coding, while the diffusion buffer enables feature propagation in TCG blocks for improved temporal consistency. LoRA hu2022lora is employed for efficient diffusion fine-tuning.
  • Figure 4: Semantic Distillation in S$^2$VC. The pretrained DINOv3 serves as a teacher, providing temporally stable and semantically rich features that are distilled into the semantic adapter.
  • Figure 5: Temporal Consistency Guidance (TCG) design and gradient flow. $l^{i}_{t}$ denotes the intermediate feature at the $i$-th scale of frame $t$. $f^{i}_{\text{in}}$ / $f^{i}_{\text{out}}$ are input and output of a TCG block. $L_{D}$ is the distortion loss, and $T$ is the total frame count.
  • ...and 16 more figures