Generative Video Compression with One-Dimensional Latent Representation

Zihan Zheng; Zhaoyang Jia; Naifu Xue; Jiahao Li; Bin Li; Zongyu Guo; Xiaoyi Zhang; Zhenghao Chen; Houqiang Li; Yan Lu

Generative Video Compression with One-Dimensional Latent Representation

Zihan Zheng, Zhaoyang Jia, Naifu Xue, Jiahao Li, Bin Li, Zongyu Guo, Xiaoyi Zhang, Zhenghao Chen, Houqiang Li, Yan Lu

Abstract

Recent advancements in generative video codec (GVC) typically encode video into a 2D latent grid and employ high-capacity generative decoders for reconstruction. However, this paradigm still leaves two key challenges in fully exploiting spatial-temporal redundancy: Spatially, the 2D latent grid inevitably preserves intra-frame redundancy due to its rigid structure, where adjacent patches remain highly similar, thereby necessitating a higher bitrate. Temporally, the 2D latent grid is less effective for modeling long-term correlations in a compact and semantically coherent manner, as it hinders the aggregation of common contents across frames. To address these limitations, we introduce Generative Video Compression with One-Dimensional (1D) Latent Representation (GVC1D). GVC1D encodes the video data into extreme compact 1D latent tokens conditioned on both short- and long-term contexts. Without the rigid 2D spatial correspondence, these 1D latent tokens can adaptively attend to semantic regions and naturally facilitate token reduction, thereby reducing spatial redundancy. Furthermore, the proposed 1D memory provides semantically rich long-term context while maintaining low computational cost, thereby further reducing temporal redundancy. Experimental results indicate that GVC1D attains superior compression efficiency, where it achieves bitrate reductions of 60.4\% under LPIPS and 68.8\% under DISTS on the HEVC Class B dataset, surpassing the previous video compression methods.Project: https://gvc1d.github.io/

Generative Video Compression with One-Dimensional Latent Representation

Abstract

Paper Structure (18 sections, 5 equations, 13 figures, 4 tables)

This paper contains 18 sections, 5 equations, 13 figures, 4 tables.

Introduction
Related Work
Methodology
Framework Overview
Context Model
Analysis on 1D Latent Tokens
Experiment
Experimental Protocol
Experimental Results
Ablation Study
Complexity Analysis
Conclusion
Training Details
Experiments
Evaluation Details
...and 3 more sections

Figures (13)

Figure 1: Method comparison. (a) Previous generative video codecs qi2025generativema2025diffusionyang2022perceptual encode videos into dense 2D latent grids with rigid spatial structures using short-term context $c_{s}$, resulting in numerous inflexible tokens. (b) Our method exploits short-term $c_{s}$ and 1D-based long-term context $c_{l}$ to encode videos into a few flexible 1D latent tokens. Attention map comparison. (c) 2D latent grids preserve fixed spatial correspondences between tokens and image patches, limiting redundancy exploitation and requiring complex memory designs wang20243dVTMqian2024streaming. (d) Our 1D latent tokens adaptively attend to semantic regions, while the 1D memory, managed by a few 1D tokens, efficiently preserves long-term context in a semantically coherent and computationally efficient manner.
Figure 2: Framework overview. Q, AE and AD represent quantization, arithmetic encoder and decoder, respectively. The input image $x_t$ is first embedded into patches and then fed into the encoder composed of local and global transformers to produce $y_t$. The local transformer handles each window independently with low cost, while the global transformer captures inter-window correlations. Subsequently, an entropy model then performs autoregressive entropy coding on $y_t$. The decoder adopts an architecture similar to the encoder to reconstruct the image $\hat{x}_t$. All are guided by a context model combining long-term 1D memory and short-term context buffer to provide comprehensive temporal context.
Figure 3: 1D memory. $D_m$ denote the number of Transformer layers. We employ simple yet effective Transformer layers to manage long-term context.
Figure 4: Visualization of 1D latent token outflows across two frames during object motion. In the two figures, the lines connect points corresponding to the maximum attention weights of each token, with the numbers indicating token indices (e.g., token 19 focuses on the horse’s left foreleg in both frames). Detailed attention maps show that each token consistently focuses on the same semantic region across frames, effectively capturing object motion.
Figure 5: Visualization of the outflow variation of a 1D latent token (index 4) as a new object appears. The red boxes in the first row mark image patches with the highest attention weights, while the green lines in the second row link them to the top four 1D latent tokens with the strongest attention. The bottom row is the 1D latent tokens attention weights corresponding to the maximum weight image patch (red boxes). As new content emerges, attention weights gradually shift from previously active tokens to newly activated ones.
...and 8 more figures

Generative Video Compression with One-Dimensional Latent Representation

Abstract

Generative Video Compression with One-Dimensional Latent Representation

Authors

Abstract

Table of Contents

Figures (13)