Table of Contents
Fetching ...

Improving Reconstruction of Representation Autoencoder

Siyu Liu, Chujie Qin, Hubery Yin, Qixin Yan, Zheng-Peng Duan, Chen Li, Jing Lyu, Chun-Le Guo, Chongyi Li

TL;DR

The paper tackles the reconstruction bottleneck when using Vision Foundation Models as semantic encoders for latent diffusion models. It introduces LV-RAE, a representation autoencoder that keeps semantic features fixed while a lightweight encoder learns low-level details, producing a latent $z$ that aligns with semantic distributions yet preserves fine visual information. To address decoder sensitivity in high-dimensional latent spaces, the authors propose a two-stage robustness strategy: fine-tuning the decoder with latent noise and injecting controlled noise during diffusion sampling, which smooths off-manifold artifacts. Empirical results on ImageNet-1K demonstrate state-of-the-art reconstruction fidelity and competitive, often superior, generation quality, with notable gains in robustness and diffusion-friendly latents; the approach is also validated across multiple VFMs and backbones.

Abstract

Recent work leverages Vision Foundation Models as image encoders to boost the generative performance of latent diffusion models (LDMs), as their semantic feature distributions are easy to learn. However, such semantic features often lack low-level information (\eg, color and texture), leading to degraded reconstruction fidelity, which has emerged as a primary bottleneck in further scaling LDMs. To address this limitation, we propose LV-RAE, a representation autoencoder that augments semantic features with missing low-level information, enabling high-fidelity reconstruction while remaining highly aligned with the semantic distribution. We further observe that the resulting high-dimensional, information-rich latent make decoders sensitive to latent perturbations, causing severe artifacts when decoding generated latent and consequently degrading generation quality. Our analysis suggests that this sensitivity primarily stems from excessive decoder responses along directions off the data manifold. Building on these insights, we propose fine-tuning the decoder to increase its robustness and smoothing the generated latent via controlled noise injection, thereby enhancing generation quality. Experiments demonstrate that LV-RAE significantly improves reconstruction fidelity while preserving the semantic abstraction and achieving strong generative quality. Our code is available at https://github.com/modyu-liu/LVRAE.

Improving Reconstruction of Representation Autoencoder

TL;DR

The paper tackles the reconstruction bottleneck when using Vision Foundation Models as semantic encoders for latent diffusion models. It introduces LV-RAE, a representation autoencoder that keeps semantic features fixed while a lightweight encoder learns low-level details, producing a latent that aligns with semantic distributions yet preserves fine visual information. To address decoder sensitivity in high-dimensional latent spaces, the authors propose a two-stage robustness strategy: fine-tuning the decoder with latent noise and injecting controlled noise during diffusion sampling, which smooths off-manifold artifacts. Empirical results on ImageNet-1K demonstrate state-of-the-art reconstruction fidelity and competitive, often superior, generation quality, with notable gains in robustness and diffusion-friendly latents; the approach is also validated across multiple VFMs and backbones.

Abstract

Recent work leverages Vision Foundation Models as image encoders to boost the generative performance of latent diffusion models (LDMs), as their semantic feature distributions are easy to learn. However, such semantic features often lack low-level information (\eg, color and texture), leading to degraded reconstruction fidelity, which has emerged as a primary bottleneck in further scaling LDMs. To address this limitation, we propose LV-RAE, a representation autoencoder that augments semantic features with missing low-level information, enabling high-fidelity reconstruction while remaining highly aligned with the semantic distribution. We further observe that the resulting high-dimensional, information-rich latent make decoders sensitive to latent perturbations, causing severe artifacts when decoding generated latent and consequently degrading generation quality. Our analysis suggests that this sensitivity primarily stems from excessive decoder responses along directions off the data manifold. Building on these insights, we propose fine-tuning the decoder to increase its robustness and smoothing the generated latent via controlled noise injection, thereby enhancing generation quality. Experiments demonstrate that LV-RAE significantly improves reconstruction fidelity while preserving the semantic abstraction and achieving strong generative quality. Our code is available at https://github.com/modyu-liu/LVRAE.
Paper Structure (46 sections, 15 equations, 26 figures, 5 tables)

This paper contains 46 sections, 15 equations, 26 figures, 5 tables.

Figures (26)

  • Figure 1: Conceptual decomposition of the real data manifold. We hypothesize that the real data manifold can be decomposed into two distinct components: a smooth base manifold representing global semantics (captured by VFMs) and local variations representing low-level information (ignored by VFMs).
  • Figure 2: Overview of previous methods and LV-RAE. a) Training an autoencoder to align with VFMs fails to adequately preserve semantic consistency. b) Directly using a VFM as an autoencoder suffers from severely degraded reconstruction quality. c) The proposed LV-RAE significantly enhances reconstruction fidelity while effectively maintaining semantic representations. d) Fine-tune the LV-RAE decoder to improve robustness of latent perturbations, making it suitable for generation.
  • Figure 3: The t-SNE visualization of DINOv3's semantic features with LV-RAE's latents in a shared representation space.Left: 20-class setting. Right: 2-class setting. LV-RAE latents exhibit strong overlap with DINOv3 semantic features, suggesting that they lie in a tightly shared representation space with minimal distributional discrepancy.
  • Figure 4: Toy Experiment. A 2-dimensional underlying data is embedded into a $D$-dimensional space to train a diffusion model. The generated samples are projected back to two dimensions using a decoder that responds to both manifold and off-manifold directions for visualization. The parameter $\alpha$ controls the decoder’s sensitivity to off-manifold directions. In the high-dimensional setting ($D=128$), increasing the decoder’s sensitivity to off-manifold directions (larger $\alpha$) causes deviations along these directions to accumulate and be amplified by the decoder, resulting in severe departures from the ground truth.
  • Figure 5: Visualization of decoder sensitivity to latent perturbations.Top: In high-dimensional latent spaces, the decoder is sensitive to latent perturbations, even small perturbations (e.g., +0.1 noise) can produce pronounced pixel-level artifacts. Bottom: This sensitivity undermines generation because generative models often struggle to accurately capture the true data distribution, making even minor sampling shifts capable of causing significant structural distortions in the output. Our approach enhances generation quality by fine-tuning the decoder to increase robustness to latent perturbations and smoothing the generated latent via controlled noise injection.
  • ...and 21 more figures