Table of Contents
Fetching ...

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Peng Sun, Jun Xie, Tao Lin

Abstract

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only $\sim \textbf{1050}$ H800 GPU hours (with the vast majority, $\textbf{1000}$ hours, dedicated to the efficient $\textbf{image-only pre-training stage}$). It achieves $\textbf{0.89}$ on GenEval and $\textbf{0.55}$ on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available $\href{https://github.com/LINs-lab/IOMM}{https://github.com/LINs-lab/IOMM}$.

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Abstract

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their , which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for and identify these two issues as the major bottlenecks. To address them, we propose , a data-efficient two-stage training framework. The first stage pre-trains the visual generative component using abundant unlabeled image-only data, thereby removing the dependency on paired data . The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only H800 GPU hours (with the vast majority, hours, dedicated to the efficient ). It achieves on GenEval and on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available .
Paper Structure (41 sections, 7 figures, 10 tables, 1 algorithm)

This paper contains 41 sections, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: An overview and validation of our proposed training paradigm.(a) Visual results of our IOMM-XL, demonstrating high-quality, multi-resolution image synthesis. Corresponding prompts are provided in \ref{['app:prompts_details']}. (b) An illustration of the six training recipes we investigate. (c) Quantitative results of six training recipes on the GenEval benchmark.
  • Figure 2: Visualization of the IOMM framework.(a) The architecture of our proposed framework. (b) Ablation study demonstrating the effectiveness of architectural design choices, confirming that each component contributes positively to the final GenEval score. All variants utilize the same IOMM-XL architecture.
  • Figure 3: Analysis of different data paradigms. Fine-tuning performance comparison of models pre-trained on different data compositions (image-only, text-image pair) across distinct datasets.
  • Figure 4: Ablation studies of key components in IOMM. These experiments analyze the impact of our primary design choices: (a) the residual query adapter, (b) the mask ratio for sparse reconstruction, and (c) the data mixture ratio during fine-tuning.
  • Figure 5: Image editing ability with different pre-training method.
  • ...and 2 more figures