PairUni: Pairwise Training for Unified Multimodal Language Models
Jiani Zheng, Zhiyang Teng, Xiangtai Li, Anran Wang, Yu Tian, Kunpeng Qiu, Ye Tian, Haochen Wang, Zhuochen Wang
TL;DR
PairUni tackles the core challenge of balancing understanding and generation in unified vision–language models during reinforcement learning by introducing UG pairs and a pair-aware optimization. The data-first approach augments and aligns supervision with GPT-based generation, building aligned UG quadruples and retrieval-based cross-instance pairs to expose cross-task correspondences. The optimization component, Pair-GPRO, weights policy updates by pair similarity to reinforce well-aligned supervision while mitigating task interference, with strong empirical gains on MMMU, MMStar, MME, GenEval, and Wise across 1B and 7B scales, plus transfer to Lumina-DiMOO. The authors also provide a 16K UG paired dataset (PairUG) and demonstrate generality across architectures, underlining the practical impact of data alignment and similarity-weighted credit assignment for unified multimodal training.
Abstract
Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Codes are available at https://github.com/Haochen-Wang409/PairUni.
