Table of Contents
Fetching ...

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo

TL;DR

The paper identifies two core obstacles in using representation encoders for text-to-image generation: lack of compact regularization in high-dimensional feature spaces and weak pixel-level reconstruction. It introduces a semantic-pixel reconstruction framework that compresses semantic information into a 96-channel latent (S-VAE) and then enriches it with pixel details (PS-VAE), yielding superior reconstruction and enabling fast, high-fidelity generation. Through a Deep-Fusion architecture, PS-VAE delivers strong results in text-to-image generation and instruction-based editing, outperforming RAE and traditional VAE baselines while generalizing to SigLIP2. Extensive ablations show semantic regularization is essential to avoid off-manifold artifacts, and that a balanced combination of semantic structure and pixel fidelity yields the best overall performance.

Abstract

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

TL;DR

The paper identifies two core obstacles in using representation encoders for text-to-image generation: lack of compact regularization in high-dimensional feature spaces and weak pixel-level reconstruction. It introduces a semantic-pixel reconstruction framework that compresses semantic information into a 96-channel latent (S-VAE) and then enriches it with pixel details (PS-VAE), yielding superior reconstruction and enabling fast, high-fidelity generation. Through a Deep-Fusion architecture, PS-VAE delivers strong results in text-to-image generation and instruction-based editing, outperforming RAE and traditional VAE baselines while generalizing to SigLIP2. Extensive ablations show semantic regularization is essential to avoid off-manifold artifacts, and that a balanced combination of semantic structure and pixel fidelity yields the best overall performance.

Abstract

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.

Paper Structure

This paper contains 18 sections, 2 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Reconstruction and generation performance across different generation spaces. Compared to vanilla VAE, RAE improves generation coverage speed but quickly saturates due to its unconstrained semantic space and weak reconstruction. To address this, we project RAE features into a compact 96-channel latent space with a semantic reconstruction objective, forming S-VAE, which mitigates off-manifold issues and improves generation performance. Finally, PS-VAE further augments the semantic latent space with pixel-level reconstruction, enriching structural and texture details and achieving superior performance in both reconstruction and generation.
  • Figure 2: Visualization comparison between RAE and VAE. (a) RAE shows a noticeable gap in reconstruction performance compared to VAE. Benefiting from its rich semantic representation, RAE demonstrates stronger prompt-following ability in image editing tasks that require understanding the input image (b). However, its poor reconstruction quality limits practical usability, as it fails to preserve fine-grained and consistent details from the input image. Counterintuitively, in text-to-image generation, RAE exhibits severe structural and texture artifacts and substantially lags behind VAE (c), with a performance gap far larger than that observed in reconstruction.
  • Figure 3: Off-manifold behavior varies significantly with feature dimensionality. We construct a 2D 'PS'-shaped distribution and embed it into an 8D ambient space, yielding two learning settings with intrinsic dimension 2 and ambient dimension 8. (a) The 8D setting produces substantially more off-manifold samples than the intrinsic 2D space. (b) We measure the mean nearest-neighbor distance of the top 5% tail samples and observe that samples generated in 8D deviate much farther from the data manifold, indicating stronger off-manifold drift.
  • Figure 4: Visual comparison of generated examples across progressively improved latent spaces (RAE $\rightarrow$ S-VAE $\rightarrow$ PS-VAE). Artifacts are gradually reduced, with step-by-step improvements in texture and structure.
  • Figure 5: Compact latent space construction for preserving semantic structure and fine-grained details We first regularize the unconstrained representation-encoder feature space by freezing the encoder and training a semantic VAE using only the $\mathcal{L}_s$ and $\mathcal{L}_\mathrm{kl}$; during this stage, the pixel decoder is trained on the detached semantic latent with pixel reconstruction loss $\mathcal{L}_\mathrm{P}$. After semantic reconstruction converges, we unfreeze all components and allow the pixel decoder to backpropagate the gradient into the encoder, ensuring that the representation encoder captures fine-grained details of the input image.
  • ...and 7 more figures