Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Peng Sun; Jun Xie; Tao Lin

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Peng Sun, Jun Xie, Tao Lin

Abstract

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only $\sim \textbf{1050}$ H800 GPU hours (with the vast majority, $\textbf{1000}$ hours, dedicated to the efficient $\textbf{image-only pre-training stage}$). It achieves $\textbf{0.89}$ on GenEval and $\textbf{0.55}$ on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available $\href{https://github.com/LINs-lab/IOMM}{https://github.com/LINs-lab/IOMM}$.

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Abstract

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their

, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for

and identify these two issues as the major bottlenecks. To address them, we propose

, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component

using abundant unlabeled image-only data, thereby removing the dependency on paired data

. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only

H800 GPU hours (with the vast majority,

hours, dedicated to the efficient

). It achieves

on GenEval and

on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available

Paper Structure (41 sections, 7 figures, 10 tables, 1 algorithm)

This paper contains 41 sections, 7 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Text-to-image diffusion models.
Unified understanding and generation models.
Masked signal modeling.
Methodology
Preliminaries on Diffusion Models
Image-Only Pre-training via Self-Conditioning
Forming the self-conditioning signal.
Residual Query Adapter
Masked Image Modeling
Experiment
Experimental Setting
Datasets.
Neural network architectures.
...and 26 more sections

Figures (7)

Figure 1: An overview and validation of our proposed training paradigm.(a) Visual results of our IOMM-XL, demonstrating high-quality, multi-resolution image synthesis. Corresponding prompts are provided in \ref{['app:prompts_details']}. (b) An illustration of the six training recipes we investigate. (c) Quantitative results of six training recipes on the GenEval benchmark.
Figure 2: Visualization of the IOMM framework.(a) The architecture of our proposed framework. (b) Ablation study demonstrating the effectiveness of architectural design choices, confirming that each component contributes positively to the final GenEval score. All variants utilize the same IOMM-XL architecture.
Figure 3: Analysis of different data paradigms. Fine-tuning performance comparison of models pre-trained on different data compositions (image-only, text-image pair) across distinct datasets.
Figure 4: Ablation studies of key components in IOMM. These experiments analyze the impact of our primary design choices: (a) the residual query adapter, (b) the mask ratio for sparse reconstruction, and (c) the data mixture ratio during fine-tuning.
Figure 5: Image editing ability with different pre-training method.
...and 2 more figures

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Abstract

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Authors

Abstract

Table of Contents

Figures (7)