Table of Contents
Fetching ...

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, Afshin Dehghan

TL;DR

This work introduces UniGen, a unified multimodal large language model capable of image understanding and generation, trained entirely on open-source data. It presents a data-centric training lifecycle spanning multi-stage pre-training, supervised fine-tuning, and direct preference optimization, backed by a novel Chain-of-Thought Verification (CoT-V) for test-time scaling. CoT-V enables Best-of-N self-verification to improve generation quality without sacrificing understanding, and a CoT-V post-training step further enhances reasoning during inference. Across GenEval and DPG-Bench, UniGen achieves state-of-the-art results and is shown to benefit from each training stage through comprehensive ablations, offering actionable guidance for building unified MLLMs.

Abstract

We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen's image generation quality using a simple Best-of-N test-time strategy. Specifically, CoT-V enables UniGen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step CoT manner. Trained entirely on open-source datasets across all stages, UniGen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GenEval and 85.19 on DPG-Bench. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to the future research.

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

TL;DR

This work introduces UniGen, a unified multimodal large language model capable of image understanding and generation, trained entirely on open-source data. It presents a data-centric training lifecycle spanning multi-stage pre-training, supervised fine-tuning, and direct preference optimization, backed by a novel Chain-of-Thought Verification (CoT-V) for test-time scaling. CoT-V enables Best-of-N self-verification to improve generation quality without sacrificing understanding, and a CoT-V post-training step further enhances reasoning during inference. Across GenEval and DPG-Bench, UniGen achieves state-of-the-art results and is shown to benefit from each training stage through comprehensive ablations, offering actionable guidance for building unified MLLMs.

Abstract

We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen's image generation quality using a simple Best-of-N test-time strategy. Specifically, CoT-V enables UniGen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step CoT manner. Trained entirely on open-source datasets across all stages, UniGen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GenEval and 85.19 on DPG-Bench. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to the future research.

Paper Structure

This paper contains 31 sections, 1 equation, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Comparison against state-of-the-art unified MLLMs.UniGen-1.5B outperforms Show-o-1.3B, Janus-1.3B and Janus-Pro-1.5B across understanding and generation benchmarks.
  • Figure 2: The architecture of UniGen, which is based on an autoregressive LLM and decoupled vision encoders for image understanding and generation tasks.
  • Figure 3: The workflow of UniGen using test-time scaling and CoT-V.Left: Illustration of Best-of-N selection with CoT-V. UniGen first generates 6 image candidates and then selects the two images with the highest score after self-verification using CoT-V. Right: Visualization of the step-by-step reasoning process in CoT-V for computing the final quality score.
  • Figure 4: An example of using different image verification methods: (a) Outcome Verification, (b) Rule-based Verification and (c) Chain-of-Thought Verification.
  • Figure 5: Visual examples of UniGen's results using CoT-V. The first three rows show examples for counting, position, and color attribute, respectively, and the last row shows images generated by free-form prompts. The first column contains images selected by UniGen as the test-time verifier.
  • ...and 4 more figures