Table of Contents
Fetching ...

Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow

Ziyue Zeng, Xun Su, Haoyuan Liu, Bingyu Lu, Yui Tatsumi, Hiroshi Watanabe

Abstract

Recent advances in generative modeling have enabled perceptual video compression at ultra-low bitrates, yet existing methods predominantly treat the generative model as a refinement or reconstruction module attached to a separately designed codec backbone. We propose \emph{Generative Video Codebook Codec} (GVCC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with autoregressive GOP chaining, tail latent residual correction, and adaptive atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVCC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.

Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow

Abstract

Recent advances in generative modeling have enabled perceptual video compression at ultra-low bitrates, yet existing methods predominantly treat the generative model as a refinement or reconstruction module attached to a separately designed codec backbone. We propose \emph{Generative Video Codebook Codec} (GVCC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with autoregressive GOP chaining, tail latent residual correction, and adaptive atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVCC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.

Paper Structure

This paper contains 36 sections, 18 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: GVCC produces perceptually superior reconstruction at ultra-low bitrates. Left: diagonal split comparison between DCVC-RT (0.0017 bpp, LPIPS 0.394) and our GVCC-T2V (0.0016 bpp, LPIPS 0.117) on the UVG Jockey sequence---GVCC recovers sharp textures and coherent details while DCVC-RT exhibits severe oversmoothing. Middle: zoomed-in crops comparing DCVC-RT, GNVC-VD, and GVCC at comparable bitrates. Right: LPIPS comparison shows GVCC achieves a 70.3% reduction over DCVC-RT; a user study confirms GVCC-T2V is preferred over DCVC-RT in 97% and over GNVC-VD in 88% of pairwise comparisons.
  • Figure 2: Overview of the GVCC framework. Top: shared pipeline---a frozen 3D VAE encodes the GOP into latent space, where GVCC compresses it into codebook noise indices; the decoder replays the same trajectory to reconstruct the video. Bottom: three conditioning strategies. (a) T2V: codebook only, no reference frame. (b) I2V: autoregressive GOP chaining with tail residual correction. (c) FLF2V: dual-anchor boundary sharing across consecutive GOPs.
  • Figure 3: Temporal stability: consecutive-frame MAE across GOPs. (a) HoneyBee and (b) Beauty: T2V (blue) shows periodic upward spikes at GOP boundaries; FLF2V (green) produces the smoothest profile. (c) Jockey (high motion): I2V-AR (red) exhibits V-shaped downward spikes at GOP boundaries where tail residual correction forces the last frame close to GT, creating sharp contrast against the high baseline MAE (${\sim}$10--15). FLF2V maintains stable continuity across all content types.
  • Figure 4: Hyperparameter sweeps on UVG Beauty (T2V-1.3B, 720p). Blue: PSNR ($\uparrow$). Red: LPIPS ($\downarrow$). Purple: encoding time. Stars: selected defaults. (a) Atom count $M$: quality saturates around $M{=}64$ while BPP grows linearly. (b) Codebook size $K$: diminishing returns beyond 16384 at rapidly increasing cost. (c) Steps $T$: catastrophic at $T{=}5$, sharp improvement to $T{=}20$, marginal gains after. (d) Diffusion scale $g_{\mathrm{scale}}$: narrow sweet spot at 2.0--3.0; collapse at higher values. (e) GOP length: 17 frames insufficient; 33 and 49 comparable.
  • Figure 5: Rate--distortion curve of GVCC-T2V (1.3B, 480p, UVG average). Each point corresponds to a $(M, K)$ configuration from Table \ref{['tab:rd_sweep']}. The curve spans from 5.3 kbps to 322.8 kbps with monotonically increasing quality.