Table of Contents
Fetching ...

Reconstruction Alignment Improves Unified Multimodal Models

Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang

TL;DR

This work tackles the gap between understanding and generation in unified multimodal models by replacing sparse image captions with dense semantic supervision from a visual understanding encoder. The authors introduce Reconstruction Alignment (RecA), a post-training strategy that conditions a UMM on its own embeddings and trains it to reconstruct the input image using a self-supervised loss, achieving substantial gains in image generation and editing with modest compute. RecA demonstrates state-of-the-art results across multiple UMM architectures and benchmarks, and is robust across different training setups, encoder types, and resolutions. The findings advocate for RecA as a general, efficient alignment technique that can complement caption-based pretraining for better visual fidelity and controllable editing.

Abstract

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

Reconstruction Alignment Improves Unified Multimodal Models

TL;DR

This work tackles the gap between understanding and generation in unified multimodal models by replacing sparse image captions with dense semantic supervision from a visual understanding encoder. The authors introduce Reconstruction Alignment (RecA), a post-training strategy that conditions a UMM on its own embeddings and trains it to reconstruct the input image using a self-supervised loss, achieving substantial gains in image generation and editing with modest compute. RecA demonstrates state-of-the-art results across multiple UMM architectures and benchmarks, and is robust across different training setups, encoder types, and resolutions. The findings advocate for RecA as a general, efficient alignment technique that can complement caption-based pretraining for better visual fidelity and controllable editing.

Abstract

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.730.90) and DPGBench (80.9388.15), while also boosting editing benchmarks (ImgEdit 3.383.75, GEdit 6.947.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

Paper Structure

This paper contains 35 sections, 14 equations, 28 figures, 11 tables.

Figures (28)

  • Figure 1: Post-training UMMs with reconstruction alignment (i.e., RecA) substantially improve image generation and editing.Left: performance comparison on GenEval and DPGBench, where a 1.5B-parameter model post-trained with RecA surpasses much larger models across multiple benchmarks (Table \ref{['tab:main_table']}: GenEval, DPGBench and WISE; Middle: compared with GPT-4o, RecA follows generation instructions more faithfully, especially for color attributes and spatial positions; Right: for editing, RecA better preserves instance identity, overall layout, and object shapes of the original images, such as the girl's lips.
  • Figure 2: Dense supervision from visual embeddings. a) Typical image generation models are trained on image–caption pairs and/or sequences whose text is a sparse representation of visual information. An image is worth far more than a hundred words and contains rich details that text alone cannot capture. As shown in the left three examples, even lengthy captions (500 words) miss key aspects such as textures, styles, layouts, shapes, and attributes, leading to imperfect generations relative to the original image. b) By contrast, embeddings from visual understanding encoders, e.g., CLIP, preserve richer and more faithful semantics. Can these image–embedding pairs provide the dense supervision needed to enhance image generation and editing? Surprisingly, the answer is yes: we find that image–embedding pairs can improve T2I and image editing in a zero-shot manner.
  • Figure 3: UMMs can often correctly recognize an uncommon concept (yellow broccoli) but fail to generate it, revealing misalignment between understanding and generation.
  • Figure 4: Overview of the semantic reconstruction alignment (RecA) pipeline. A visual understanding encoder (e.g., CLIP or SigLIP) extracts semantic features from the input image, which are fused with template text embeddings and passed to the Unified Multimodal Model (UMM) to regenerate the image. The UMM is optimized with a self-supervised loss (diffusion or cross-entropy) between the original and generated images or image latents. We freeze the understanding encoder except in cases where the UMM employs shared encoder for both understanding and generation (e.g., Harmon). At inference time, RecArequires no additional inputs, operating as a standard UMM.
  • Figure 5: Post-training with RecA restores visual details missed by the baseline models. For each query image (left), we feed its visual understanding embeddings back into the UMM with the instruction "Describe the image in detail." The baseline model (center)'s visual responses, i.e., images, preserve the main subject but distort layout, textures, and colors, while RecA markedly restores visual details like geometry, color, and overall fidelity.
  • ...and 23 more figures