UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

Rui Tian; Mingfei Gao; Mingze Xu; Jiaming Hu; Jiasen Lu; Zuxuan Wu; Yinfei Yang; Afshin Dehghan

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, Afshin Dehghan

TL;DR

This work introduces UniGen, a unified multimodal large language model capable of image understanding and generation, trained entirely on open-source data. It presents a data-centric training lifecycle spanning multi-stage pre-training, supervised fine-tuning, and direct preference optimization, backed by a novel Chain-of-Thought Verification (CoT-V) for test-time scaling. CoT-V enables Best-of-N self-verification to improve generation quality without sacrificing understanding, and a CoT-V post-training step further enhances reasoning during inference. Across GenEval and DPG-Bench, UniGen achieves state-of-the-art results and is shown to benefit from each training stage through comprehensive ablations, offering actionable guidance for building unified MLLMs.

Abstract

We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen's image generation quality using a simple Best-of-N test-time strategy. Specifically, CoT-V enables UniGen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step CoT manner. Trained entirely on open-source datasets across all stages, UniGen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GenEval and 85.19 on DPG-Bench. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to the future research.

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

TL;DR

Abstract

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)