Table of Contents
Fetching ...

Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

Jiajun He, Zongyu Guo, Zhaoyang Jia, Xiaoyi Zhang, Jiahao Li, Xiao Li, Bin Li, José Miguel Hernández-Lobato, Yan Lu

TL;DR

This work introduces a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model, suggesting a unified framework bridging visual compression and generation.

Abstract

Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual signals, \textit{e.g.}, an 81-frame video, can further be hashed into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.

Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

TL;DR

This work introduces a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model, suggesting a unified framework bridging visual compression and generation.

Abstract

Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual signals, \textit{e.g.}, an 81-frame video, can further be hashed into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.
Paper Structure (61 sections, 1 theorem, 45 equations, 19 figures, 3 tables, 1 algorithm)

This paper contains 61 sections, 1 theorem, 45 equations, 19 figures, 3 tables, 1 algorithm.

Key Result

Proposition 2.1

Assume the learned vector field $v(\cdot,t)$ is $L$-Lipschitz for any $t$. Let $e_t$ denote the approximation error of the learned vector field at time $t$ along the clean path, and let $\delta_t$ be the state error at time $t$. Then the final reconstruction error $\delta$ satisfies

Figures (19)

  • Figure 1: (a) Explicit representations by encoding signals into symbolic latent variables. (b) Implicit representations that encode signal information implicitly in functions. (c) One-vector adaptation of large-scale generative models can serve as implicit visual representations.
  • Figure 2: Framework for visual compression with a single vector that adapts the diffusion foundation model.
  • Figure 3: Reconstruction quality v.s. training step. (a) Common LoRA representations with different ranks for image Kodim03 from Kodak dataset Kodakdataset. (b) One-vector representations for image Kodim03, varying LoRA rank and vector size after hashing. (c) One-vector representations for video Beauty from UVG dataset mercat2020uvg.
  • Figure 4: Comparisons with existing video codecs on UVG. For DISTS and FVD, lower is better. For PSNR, higher is better.
  • Figure 5: Visual comparisons with different methods. More results are in Appendix \ref{['appendix:more_visual']} and supplementary materials.
  • ...and 14 more figures

Theorems & Definitions (2)

  • Proposition 2.1
  • proof