Table of Contents
Fetching ...

High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

Ji Woo Hong, Hee Suk Yoon, Gwanhyeong Koo, Eunseop Yoon, SooHwan Eom, Qi Dai, Chong Luo, Chang D. Yoo

Abstract

Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we propose a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained VLMs, thereby preserving the original model intact. At its core, Logit-to-Code Distributional Mapping converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features, providing an effective conditioning signal for diffusion decoding. A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap. Conditioned on these representations, the Distribution-Conditioned Diffusion Decoder generates high-fidelity images. Achieved solely through short training on ImageNet-1K, our method consistently improves visual fidelity for both VQ-VAE reconstructions and text-to-image generations from VLM-predicted tokens.

High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

Abstract

Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we propose a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained VLMs, thereby preserving the original model intact. At its core, Logit-to-Code Distributional Mapping converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features, providing an effective conditioning signal for diffusion decoding. A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap. Conditioned on these representations, the Distribution-Conditioned Diffusion Decoder generates high-fidelity images. Achieved solely through short training on ImageNet-1K, our method consistently improves visual fidelity for both VQ-VAE reconstructions and text-to-image generations from VLM-predicted tokens.
Paper Structure (32 sections, 26 equations, 11 figures, 8 tables)

This paper contains 32 sections, 26 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Comparison of visual reconstruction quality from VQ-VAE–encoded representations. While the VQ-VAE decoded output exhibits shape distortion and loss of detail, our diffusion-based method consistently improves image fidelity, preserving structural accuracy and fine-grained visual sharpness.
  • Figure 2: Conceptual comparison of different strategies for improving visual fidelity in VLM image generation. (a) Tokenizer replacement retrains the entire VLM with a new continuous or hybrid tokenizer. (b) Diffusion-assisted decoding jointly optimizes the VLM and diffusion model, altering the original VLM. (c) Ours train only the diffusion-based decoder, preserving VLM's original capabilities.
  • Figure 3: Overview of our proposed framework. During inference, the pre-trained VLM produces Image-Token Logits, which are transformed in the Logit-to-Code Distributional Mapping stage into continuous Distribution-Weighted Code Vectors and complementary Uncertainty Features capturing the reliability and ambiguity of each token prediction. These continuous representations serve as conditioning inputs to the Distribution-Conditioned Diffusion Decoder, which refines localized visual structures to generate high-quality images. During training, instead of forwarding the VLM itself, the logits are approximated using the VQ-VAE Encoder from the VLM’s pre-training pipeline, followed by lightweight Logit Calibration to match the VLM's logit distribution.
  • Figure 4: Ablation results comparing the effect of discrete conditioning (Diff. Dec. (baseline)), continuous distribution-weighted conditioning (DCDD+LCDM), and calibrated continuous conditioning (DCDD+LCDM+LC) on ImageNet-1K image reconstruction.
  • Figure 5: Qualitative results on text-to-image generation. Comparison between images decoded from the VLM’s predicted tokens using its native VQ-VAE decoder and our diffusion-based decoder on prompts from the MJHQ-30K benchmark. Additional samples with corresponding prompts are provided in the Appendix.
  • ...and 6 more figures