Table of Contents
Fetching ...

RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, Lijun Zhang

Abstract

Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.

RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Abstract

Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.
Paper Structure (36 sections, 5 equations, 9 figures, 9 tables)

This paper contains 36 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Motivation. A practical tokenizer for diffusion must simultaneously achieve high reconstruction fidelity for editing and strong generative tractability for diffusion training, while preserving the semantic structure of pretrained representation models.
  • Figure 2: Overview of RPiAE. A pretrained RM encoder extracts representation features, which are compressed by a variational bridge into diffusion-friendly latents and decoded back for pixel-space reconstruction; a frozen pivot replica provides semantic supervision during training.
  • Figure 3: Three-stage training of RPiAE: (I) pivot-regularized encoder tuning, (II) variational bridge training with KL regularization, and (III) decoder specialization under fixed latents.
  • Figure 4: Performance of GenEval, DPG-Bench, and GEdit over training for different encoders. Our method achieves both a higher performance ceiling and faster convergence.
  • Figure 5: Visualizations of Text to Image Generation in (a) and Image Editing in (b).
  • ...and 4 more figures