Table of Contents
Fetching ...

VUGEN: Visual Understanding priors for GENeration

Xiangyi Chen, Théophane Vallaeys, Maha Elbayad, John Nguyen, Jakob Verbeek

TL;DR

VUGEN addresses the challenge of enabling robust image generation from Vision-Language Models (VLMs) by exploiting the VLM’s native visual understanding priors. It introduces a dimension reducer to compress the high-dimensional understanding embeddings into a tractable latent space $\tilde{\mathcal{Z}}$, while training the VLM to sample within this space; a pixel-space decoder then maps generated latents to images. The method employs flow matching with a velocity field predictor in a Mixture of Transformers to learn sampling in $\tilde{\mathcal{Z}}$, and compares two pixel decoders: a latent diffusion model (LDM) conditioned on $\tilde{z}$ and a lightweight pixel diffusion decoder (PDD). Across StockMix and ImageNet, VUGEN outperforms baselines in FID and alignment metrics and maintains the base VLM’s understanding capabilities, while reducing architectural complexity relative to prior two-stage approaches. This work demonstrates that generating directly in a VLM’s understanding latent space, paired with a simple pixel decoder, yields efficient, high-quality multimodal generation with strong prompt following and diversity.

Abstract

Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations, or architectural complexity. In this work, we propose VUGEN, a novel framework that explicitly leverages VLM's pretrained visual understanding priors for efficient and high-quality image generation. Our approach first transforms the high-dimensional latent space of the VLM's native vision encoder into a lower-dimensional, tractable distribution that maximally preserves visual information. The VLM is then trained to sample within this reduced latent space, ensuring alignment with its visual understanding capabilities. Finally, a dedicated pixel decoder maps these generated latents back to the image space. We find that a VAE-free pixel diffusion decoder to be on par or better than commonly used complex latent diffusion decoders that internally rely on VAE latents. Extensive experiments demonstrate that VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM's original understanding capabilities.

VUGEN: Visual Understanding priors for GENeration

TL;DR

VUGEN addresses the challenge of enabling robust image generation from Vision-Language Models (VLMs) by exploiting the VLM’s native visual understanding priors. It introduces a dimension reducer to compress the high-dimensional understanding embeddings into a tractable latent space , while training the VLM to sample within this space; a pixel-space decoder then maps generated latents to images. The method employs flow matching with a velocity field predictor in a Mixture of Transformers to learn sampling in , and compares two pixel decoders: a latent diffusion model (LDM) conditioned on and a lightweight pixel diffusion decoder (PDD). Across StockMix and ImageNet, VUGEN outperforms baselines in FID and alignment metrics and maintains the base VLM’s understanding capabilities, while reducing architectural complexity relative to prior two-stage approaches. This work demonstrates that generating directly in a VLM’s understanding latent space, paired with a simple pixel decoder, yields efficient, high-quality multimodal generation with strong prompt following and diversity.

Abstract

Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations, or architectural complexity. In this work, we propose VUGEN, a novel framework that explicitly leverages VLM's pretrained visual understanding priors for efficient and high-quality image generation. Our approach first transforms the high-dimensional latent space of the VLM's native vision encoder into a lower-dimensional, tractable distribution that maximally preserves visual information. The VLM is then trained to sample within this reduced latent space, ensuring alignment with its visual understanding capabilities. Finally, a dedicated pixel decoder maps these generated latents back to the image space. We find that a VAE-free pixel diffusion decoder to be on par or better than commonly used complex latent diffusion decoders that internally rely on VAE latents. Extensive experiments demonstrate that VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM's original understanding capabilities.

Paper Structure

This paper contains 15 sections, 1 equation, 10 figures, 3 tables.

Figures (10)

  • Figure 1: VUGEN inference (left): The complex VLM vision encoder space $\mathcal{Z}$ is reduced in dimension to $\tilde{\mathcal{Z}}$ for generative modeling. VUGEN samples in $\tilde{\mathcal{Z}}$, and the pixel decoder maps the generated latents to image space. VUGEN training (right): We first jointly train the dimension reducer and pixel decoder to ensure a latent space optimized for generation. Then the learned dimension reducer is frozen, and the VLM is trained to sample over the (fixed) reduced space $\tilde{\mathcal{Z}}$.
  • Figure 2: Original image (left) reconstructed from understanding latents $z$ (middle) and reduced latents $\tilde{z}$ (right). Accurate reconstruction indicates that both the full $z$ and the reduced $\tilde{z}$ understanding latents retain sufficient visual information.
  • Figure 3: Generation performance across training iteration on StockMix and ImageNet. VUGEN reaches better performance in fewer training steps than compared baselines.
  • Figure 4: Qualitative comparison of images generated by models trained on StockMix. VUGEN demonstrates significantly stronger prompt following capabilities (e.g. rainbow in column 1, umbrella and bench in column 2, fruit in column 3) and produces more realistic outputs as well as finer visual details (columns 4, 5 and 6).
  • Figure 5: Sample diversity analysis of models trained on StockMix. Images generated with: "A bear looking forward in a forest." and "A fresh looking salad on a square plate.". While baselines tend to produce repetitive outputs (the same pose of the bear and uniform background for the salad), VUGEN exhibits variability in camera angles, backgrounds, and overall appearance.
  • ...and 5 more figures