Table of Contents
Fetching ...

Context Guided Transformer Entropy Modeling for Video Compression

Junlong Tong, Wei Zhang, Yaohui Jin, Xiaoyu Shen

TL;DR

The paper tackles the high computational cost of using temporal context and the lack of explicit spatial dependency ordering in conditional entropy models for video compression. It introduces the Context Guided Transformer (CGT) entropy model, consisting of a Temporal Context Resampler that uses learnable queries with window cross-attention and a Dependency-Weighted Spatial Context Assigner built on a teacher-student Swin Transformer with a random masking proxy task. Training aligns with inference through the teacher guiding the student via soft top-k token selection, while inference relies solely on the student to predict undecoded tokens. Empirical results show a ~65% reduction in entropy modeling time and an ~11% BD-Rate improvement over prior state-of-the-art methods, with strong generalization across datasets and frame codecs.

Abstract

Conditional entropy models effectively leverage spatio-temporal contexts to reduce video redundancy. However, incorporating temporal context often introduces additional model complexity and increases computational cost. In parallel, many existing spatial context models lack explicit modeling the ordering of spatial dependencies, which may limit the availability of relevant context during decoding. To address these issues, we propose the Context Guided Transformer (CGT) entropy model, which estimates probability mass functions of the current frame conditioned on resampled temporal context and dependency-weighted spatial context. A temporal context resampler learns predefined latent queries to extract critical temporal information using transformer encoders, reducing downstream computational overhead. Meanwhile, a teacher-student network is designed as dependency-weighted spatial context assigner to explicitly model the dependency of spatial context order. The teacher generates an attention map to represent token importance and an entropy map to reflect prediction certainty from randomly masked inputs, guiding the student to select the weighted top-k tokens with the highest spatial dependency. During inference, only the student is used to predict undecoded tokens based on high-dependency context. Experimental results demonstrate that our CGT model reduces entropy modeling time by approximately 65% and achieves an 11% BD-Rate reduction compared to the previous state-of-the-art conditional entropy model.

Context Guided Transformer Entropy Modeling for Video Compression

TL;DR

The paper tackles the high computational cost of using temporal context and the lack of explicit spatial dependency ordering in conditional entropy models for video compression. It introduces the Context Guided Transformer (CGT) entropy model, consisting of a Temporal Context Resampler that uses learnable queries with window cross-attention and a Dependency-Weighted Spatial Context Assigner built on a teacher-student Swin Transformer with a random masking proxy task. Training aligns with inference through the teacher guiding the student via soft top-k token selection, while inference relies solely on the student to predict undecoded tokens. Empirical results show a ~65% reduction in entropy modeling time and an ~11% BD-Rate improvement over prior state-of-the-art methods, with strong generalization across datasets and frame codecs.

Abstract

Conditional entropy models effectively leverage spatio-temporal contexts to reduce video redundancy. However, incorporating temporal context often introduces additional model complexity and increases computational cost. In parallel, many existing spatial context models lack explicit modeling the ordering of spatial dependencies, which may limit the availability of relevant context during decoding. To address these issues, we propose the Context Guided Transformer (CGT) entropy model, which estimates probability mass functions of the current frame conditioned on resampled temporal context and dependency-weighted spatial context. A temporal context resampler learns predefined latent queries to extract critical temporal information using transformer encoders, reducing downstream computational overhead. Meanwhile, a teacher-student network is designed as dependency-weighted spatial context assigner to explicitly model the dependency of spatial context order. The teacher generates an attention map to represent token importance and an entropy map to reflect prediction certainty from randomly masked inputs, guiding the student to select the weighted top-k tokens with the highest spatial dependency. During inference, only the student is used to predict undecoded tokens based on high-dependency context. Experimental results demonstrate that our CGT model reduces entropy modeling time by approximately 65% and achieves an 11% BD-Rate reduction compared to the previous state-of-the-art conditional entropy model.

Paper Structure

This paper contains 29 sections, 7 equations, 6 figures, 6 tables, 2 algorithms.

Figures (6)

  • Figure 1: Existing conditional entropy modeling methods incur high computational cost for temporal context modeling, and rely on predefined spatial orders instead of explicit modeling.
  • Figure 1: Detailed structure of the components in our CGT model. The TCR and decoder employ the same architecture for concise. In practice, the number of the temporal context resampler is set as 1, the swin transformer encoder have 2 blocks, and the swin transformer decoder have 4 blocks.
  • Figure 2: Overview of our video compression model. The model generates a latent representation $y_t$ of the current frame $x_t$ through a contextual frame codec. Given temporal and spatial contexts, the CGT model performs probabilistic modeling to provide the PMF for entropy coding. The temporal contexts such as $y_{t-1}$ are first resampled to capture key dependencies and significantly reduce model overhead. A transformer-based teacher network generates a masked representation $y_t^M$ by modeling the importance of spatial context, guiding the student network to efficiently utilize contextual information.
  • Figure 3: Comparison of window self-attention (top) and window cross-attention (bottom), where TCR utilizes window cross-attention to compress temporal latents through learnable queries.
  • Figure 4: Generalization capability of our CGT entropy model based on DCVC-DC framework.
  • ...and 1 more figures