Reconstruction Alignment Improves Unified Multimodal Models

Ji Xie; Trevor Darrell; Luke Zettlemoyer; XuDong Wang

Reconstruction Alignment Improves Unified Multimodal Models

Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang

TL;DR

This work tackles the gap between understanding and generation in unified multimodal models by replacing sparse image captions with dense semantic supervision from a visual understanding encoder. The authors introduce Reconstruction Alignment (RecA), a post-training strategy that conditions a UMM on its own embeddings and trains it to reconstruct the input image using a self-supervised loss, achieving substantial gains in image generation and editing with modest compute. RecA demonstrates state-of-the-art results across multiple UMM architectures and benchmarks, and is robust across different training setups, encoder types, and resolutions. The findings advocate for RecA as a general, efficient alignment technique that can complement caption-based pretraining for better visual fidelity and controllable editing.

Abstract

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

Reconstruction Alignment Improves Unified Multimodal Models

TL;DR

Abstract

Reconstruction Alignment Improves Unified Multimodal Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (28)