Table of Contents
Fetching ...

DREAM: Where Visual Understanding Meets Text-to-Image Generation

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati, Hong-You Chen, Satya Narayan Shukla, Yonghuan Yang, Jun Xiao, Xiangjun Fan, Aashu Singh, Dina Katabi, Shlok Kumar Mishra

TL;DR

DREAM is introduced, a unified framework that jointly optimizes discriminative and generative objectives, allowing unified multimodal models that excel at both visual understanding and generation.

Abstract

Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.

DREAM: Where Visual Understanding Meets Text-to-Image Generation

TL;DR

DREAM is introduced, a unified framework that jointly optimizes discriminative and generative objectives, allowing unified multimodal models that excel at both visual understanding and generation.

Abstract

Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.
Paper Structure (71 sections, 4 equations, 10 figures, 23 tables)

This paper contains 71 sections, 4 equations, 10 figures, 23 tables.

Figures (10)

  • Figure 1: Performance of self-supervised and text-to-image generative models trained on CC12M. DREAM (yellow) forms the outer envelope across both discriminative and generative axes, outperforming all baselines and unifying strong visual understanding with high-quality text-to-image generation.
  • Figure 2: DREAM framework. Images are encoded into continuous tokens via Stable Diffusion VAE and randomly masked following a masking warmup schedule. The vision encoder is trained contrastively with text, and the decoder conditions on text to predict masked tokens with a diffusion-based reconstructive loss. Text conditioning is applied only in the decoder, ensuring the encoder learns visual representations without a text shortcut.
  • Figure 3: Semantically Aligned Decoding. The model spawns $K$ parallel candidates, each partially decoded to an intermediate timestep $t$. The encoder scores each candidate by comparing its visual embedding to the prompt embedding, and the top-scoring candidate is fully decoded—improving image fidelity and text alignment without external rerankers.
  • Figure 4: Performance of DREAM across different model sizes (B/L/H/G) for $\sigma{=}0.45$. Top: Linear Probing on IN-1K. Bottom: FID on CC12M-50K with and without Semantic Aligned Decoding. DREAM consistently outperforms baselines on both metrics across different sizes.
  • Figure 5: Examples of images generated by DREAM with and without Semantically Aligned Decoding (SD). Without Semantically Aligned Decoding, the outputs exhibit less coherent structure and more low-level blur. Applying Semantically Aligned Decoding produces images with clearer details and improved consistency with the prompt, in line with the gains observed in FID and CLIP scores.
  • ...and 5 more figures