Table of Contents
Fetching ...

Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors

Yunuo Chen, Chuqin Zhou, Jiangchuan Li, Xiaoyue Ling, Bing He, Jincheng Dai, Li Song, Guo Lu

Abstract

We present a novel paradigm for ultra-low-bitrate image compression (ULB-IC) that exploits the ``temporal'' evolution in generative image compression. Specifically, we define an explicit intermediate state during decoding: a compact anchor frame, which preserves the scene geometry and semantic layout while discarding high-frequency details. We then reinterpret generative decoding as a virtual temporal transition from this anchor to the final reconstructed image.To model this progression, we leverage a pretrained video diffusion model (VDM) as temporal priors: the anchor frame serves as the initial frame and the original image as the target frame, transforming the decoding process into a next-frame prediction task.In contrast to image diffusion-based ULB-IC models, our decoding proceeds from a visible, semantically faithful anchor, which improves both fidelity and realism for perceptual image compression. Extensive experiments demonstrate that our method achieves superior objective and subjective performance. On the CLIC2020 test set, our method achieves over \textbf{50\% bitrate savings} across LPIPS, DISTS, FID, and KID compared to DiffC, while also delivering a significant decoding speedup of up to $\times$5. Code will be released later.

Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors

Abstract

We present a novel paradigm for ultra-low-bitrate image compression (ULB-IC) that exploits the ``temporal'' evolution in generative image compression. Specifically, we define an explicit intermediate state during decoding: a compact anchor frame, which preserves the scene geometry and semantic layout while discarding high-frequency details. We then reinterpret generative decoding as a virtual temporal transition from this anchor to the final reconstructed image.To model this progression, we leverage a pretrained video diffusion model (VDM) as temporal priors: the anchor frame serves as the initial frame and the original image as the target frame, transforming the decoding process into a next-frame prediction task.In contrast to image diffusion-based ULB-IC models, our decoding proceeds from a visible, semantically faithful anchor, which improves both fidelity and realism for perceptual image compression. Extensive experiments demonstrate that our method achieves superior objective and subjective performance. On the CLIC2020 test set, our method achieves over \textbf{50\% bitrate savings} across LPIPS, DISTS, FID, and KID compared to DiffC, while also delivering a significant decoding speedup of up to 5. Code will be released later.
Paper Structure (19 sections, 8 equations, 9 figures, 5 tables)

This paper contains 19 sections, 8 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: (a) Some previous diffusion-based ULB-IC methods transmit latents that serve as conditions for image diffusion models during generative decoding. (b) Our method decodes a compact anchor frame and uses a video diffusion model to evolve this anchor into the target image via next-frame prediction. Radar plot compares our model with recent generative codecs on the CLIC2020 test set.
  • Figure 2: An illustration of our next-frame prediction paradigm for generative codec.
  • Figure 3: Defocus-to-Focus Temporal Transition Prior. These generated frames show that VDM generates natural textures where appropriate (e.g., parrot feathers). NeFIC exploits this property and collapses multi-frame transitions into next-frame prediction.
  • Figure 4: Overview of Training Stage I. The input to the DiT blocks is a concatenation of three types of tokens: (1) text embeddings, (2) tokens from the anchor frame, and (3) noised tokens from the target frame. During training, only the output corresponding to the noised target tokens is supervised using a noise prediction loss, while the other two segments are discarded.
  • Figure 5: Overview of Training Stage II.(a) Compression & decompression. The Anchor Encoder is conditioned on Video-VAE latents of the original image and produces entropy-coded anchor latents. The Anchor Decoder reconstructs the anchor frame, and a bypass latent is generated from the decoding process. (b) Detailed One-step generation. A single-step video diffusion model consumes text embeddings, anchor tokens, and the bypass latent to predict the target latent, which the Video-VAE decoder converts to final reconstruction.
  • ...and 4 more figures