Table of Contents
Fetching ...

T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates

Zhitao Wang, Hengyu Man, Wenrui Li, Xingtao Wang, Xiaopeng Fan, Debin Zhao

TL;DR

T-GVC introduces a trajectory-guided generative video coding framework designed for ultra-low bitrate scenarios by combining a semantic-aware sparse motion sampling pipeline with training-free latent-space diffusion guidance. The encoder extracts dense motion trajectories, clusters them into motion instances, and encodes a compact subset of semantically important trajectories to preserve temporal semantics at low bitrate; the decoder uses a diffusion model guided by these trajectories in latent space, enabling physically plausible motion without retraining. Experimental results show T-GVC outperforms traditional codecs and prior neural methods in perceptual and semantic quality at ULB across multiple datasets, with ablations highlighting the advantages of trajectory guidance over text-based conditioning and the effectiveness of sparse motion sampling. The approach demonstrates precise motion control and flexible generation lengths, suggesting a practical pathway for efficient, semantically aware generative video coding under bandwidth constraints.

Abstract

Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding for Ultra-Low Bitrate (ULB) scenarios by leveraging powerful generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or excessive dependence on high-level text guidance, which tend to inadequately capture fine-grained motion details, leading to unrealistic or incoherent reconstructions. To address these challenges, we propose Trajectory-Guided Generative Video Coding (dubbed T-GVC), a novel framework that bridges low-level motion tracking with high-level semantic understanding. T-GVC features a semantic-aware sparse motion sampling pipeline that extracts pixel-wise motion as sparse trajectory points based on their semantic importance, significantly reducing the bitrate while preserving critical temporal semantic information. In addition, by integrating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free guidance mechanism in latent space to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that T-GVC outperforms both traditional and neural video codecs under ULB conditions. Furthermore, additional experiments confirm that our framework achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.

T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates

TL;DR

T-GVC introduces a trajectory-guided generative video coding framework designed for ultra-low bitrate scenarios by combining a semantic-aware sparse motion sampling pipeline with training-free latent-space diffusion guidance. The encoder extracts dense motion trajectories, clusters them into motion instances, and encodes a compact subset of semantically important trajectories to preserve temporal semantics at low bitrate; the decoder uses a diffusion model guided by these trajectories in latent space, enabling physically plausible motion without retraining. Experimental results show T-GVC outperforms traditional codecs and prior neural methods in perceptual and semantic quality at ULB across multiple datasets, with ablations highlighting the advantages of trajectory guidance over text-based conditioning and the effectiveness of sparse motion sampling. The approach demonstrates precise motion control and flexible generation lengths, suggesting a practical pathway for efficient, semantically aware generative video coding under bandwidth constraints.

Abstract

Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding for Ultra-Low Bitrate (ULB) scenarios by leveraging powerful generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or excessive dependence on high-level text guidance, which tend to inadequately capture fine-grained motion details, leading to unrealistic or incoherent reconstructions. To address these challenges, we propose Trajectory-Guided Generative Video Coding (dubbed T-GVC), a novel framework that bridges low-level motion tracking with high-level semantic understanding. T-GVC features a semantic-aware sparse motion sampling pipeline that extracts pixel-wise motion as sparse trajectory points based on their semantic importance, significantly reducing the bitrate while preserving critical temporal semantic information. In addition, by integrating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free guidance mechanism in latent space to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that T-GVC outperforms both traditional and neural video codecs under ULB conditions. Furthermore, additional experiments confirm that our framework achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.

Paper Structure

This paper contains 34 sections, 13 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Examples of motion representation for video (event, optical flow, trajectory and text).
  • Figure 2: Overview of our T-GVC framework. On the encoder side, each pair of keyframes and corresponding inter-frames are fed into proposed sparse motion sampler to extract motion trajectories. Subsequently, the keyframes and trajectories are encoded into compact bitstreams. On the decoder side, each decoded keyframe pair is encoded into latent features via VAE encoder. These latent features, combined with zero-initialized latent features, form a latent sequence and concatenated with the initial latent noises as input of VDM. The decoded sparse motion trajectories act as guidance conditions during the inference process to correct the motion in latent space. Ultimately, the clean output is decoded by VAE.
  • Figure 3: Illustration of the semantic correlation of latent features (upscaled to the same resolution as the original frame) in the same trajectory across two reconstructed frames. Given the red trajectory point in (b), we plot (d) according to the similarity between the feature on the point and the latent features of frame 2.
  • Figure 4: The R-D performance comparison results for HEVC Class B, Class C, UVG and MCL-JCV datasets.
  • Figure 5: Visual quality comparison: ground truth, DCVC-FM, VTM and proposed T-GVC (top to bottom). The reconstructed frames of our framework demonstrates higher perceptual quality at similar bitrates.
  • ...and 3 more figures