Table of Contents
Fetching ...

Exploring Representation-Aligned Latent Space for Better Generation

Wanghan Xu, Xiaoyu Yue, Zidong Wang, Yao Teng, Wenlong Zhang, Xihui Liu, Luping Zhou, Wanli Ouyang, Lei Bai

TL;DR

This work tackles the semantic sparsity of VAE latent spaces used in latent diffusion models and introduces Representation-Aligned Latent Space (ReaLS), which injects semantic priors by aligning VAE latents with DINOv2 features. By training an alignment network to map latents to both patch- and global-level semantic representations, and balancing KL and alignment losses, ReaLS yields a more structured latent space. Diffusion models trained in this space achieve notable gains in generation quality (FID improvements around 15%) and gain capability for training-free downstream tasks such as segmentation and depth estimation. The approach remains model-agnostic to diffusion backbones and shows promise for future exploration, including combining latent-space semantics with feature-space semantics (e.g., REPA) for even larger gains.

Abstract

Generative models serve as powerful tools for modeling the real world, with mainstream diffusion models, particularly those based on the latent diffusion model paradigm, achieving remarkable progress across various tasks, such as image and video synthesis. Latent diffusion models are typically trained using Variational Autoencoders (VAEs), interacting with VAE latents rather than the real samples. While this generative paradigm speeds up training and inference, the quality of the generated outputs is limited by the latents' quality. Traditional VAE latents are often seen as spatial compression in pixel space and lack explicit semantic representations, which are essential for modeling the real world. In this paper, we introduce ReaLS (Representation-Aligned Latent Space), which integrates semantic priors to improve generation performance. Extensive experiments show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric. Furthermore, the enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.

Exploring Representation-Aligned Latent Space for Better Generation

TL;DR

This work tackles the semantic sparsity of VAE latent spaces used in latent diffusion models and introduces Representation-Aligned Latent Space (ReaLS), which injects semantic priors by aligning VAE latents with DINOv2 features. By training an alignment network to map latents to both patch- and global-level semantic representations, and balancing KL and alignment losses, ReaLS yields a more structured latent space. Diffusion models trained in this space achieve notable gains in generation quality (FID improvements around 15%) and gain capability for training-free downstream tasks such as segmentation and depth estimation. The approach remains model-agnostic to diffusion backbones and shows promise for future exploration, including combining latent-space semantics with feature-space semantics (e.g., REPA) for even larger gains.

Abstract

Generative models serve as powerful tools for modeling the real world, with mainstream diffusion models, particularly those based on the latent diffusion model paradigm, achieving remarkable progress across various tasks, such as image and video synthesis. Latent diffusion models are typically trained using Variational Autoencoders (VAEs), interacting with VAE latents rather than the real samples. While this generative paradigm speeds up training and inference, the quality of the generated outputs is limited by the latents' quality. Traditional VAE latents are often seen as spatial compression in pixel space and lack explicit semantic representations, which are essential for modeling the real world. In this paper, we introduce ReaLS (Representation-Aligned Latent Space), which integrates semantic priors to improve generation performance. Extensive experiments show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric. Furthermore, the enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.

Paper Structure

This paper contains 24 sections, 8 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: Representation-Aligned Latent Space (ReaLS) preserves more image semantics. a) t-SNE visualization of our latent space reveals a clear clustering, with samples from the same category closer to each other. b) Attention map of our latents shows a significant improvement in the semantic relevance among patches.
  • Figure 2: The training and inference pipeline of ReaLS. During VAE training, the latents of the VAE are aligned with the features of DINOv2 using an alignment network implemented via MLP. After the VAE training concludes, latent diffusion model training is performed in this latent space. In the inference phase, the latents generated by the diffusion model are converted into corresponding generated images through the VAE decoder. At the same time, the alignment network extracts semantic features, which are provided to the corresponding downstream task heads, enabling training-free tasks such as segmentation and depth estimation.
  • Figure 3: Visualization results on ImageNet 256×256, from the SiT-XL/2 + ReaLS, with cfg=4.0.
  • Figure 4: Training-free Downstream Tasks on Latents. The diffusion model trained in the representation-aligned latent space naturally possesses stronger semantics, enabling more downstream tasks on latents. The latents generated by diffusion can obtain semantic features through the alignment network used during VAE training, and then multiple modalities of output can be achieved through the corresponding task heads. The first row displays the segmentation results, while the second row shows the depth estimation results.
  • Figure 5: Impact of KL Constraint on Latent Space and FID. As the KL weight increases from low to high, the FID initially decreases and then begins to rise again. The size of the point represents the standard deviation of the latent space.