Table of Contents
Fetching ...

PairUni: Pairwise Training for Unified Multimodal Language Models

Jiani Zheng, Zhiyang Teng, Xiangtai Li, Anran Wang, Yu Tian, Kunpeng Qiu, Ye Tian, Haochen Wang, Zhuochen Wang

TL;DR

PairUni tackles the core challenge of balancing understanding and generation in unified vision–language models during reinforcement learning by introducing UG pairs and a pair-aware optimization. The data-first approach augments and aligns supervision with GPT-based generation, building aligned UG quadruples and retrieval-based cross-instance pairs to expose cross-task correspondences. The optimization component, Pair-GPRO, weights policy updates by pair similarity to reinforce well-aligned supervision while mitigating task interference, with strong empirical gains on MMMU, MMStar, MME, GenEval, and Wise across 1B and 7B scales, plus transfer to Lumina-DiMOO. The authors also provide a 16K UG paired dataset (PairUG) and demonstrate generality across architectures, underlining the practical impact of data alignment and similarity-weighted credit assignment for unified multimodal training.

Abstract

Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Codes are available at https://github.com/Haochen-Wang409/PairUni.

PairUni: Pairwise Training for Unified Multimodal Language Models

TL;DR

PairUni tackles the core challenge of balancing understanding and generation in unified vision–language models during reinforcement learning by introducing UG pairs and a pair-aware optimization. The data-first approach augments and aligns supervision with GPT-based generation, building aligned UG quadruples and retrieval-based cross-instance pairs to expose cross-task correspondences. The optimization component, Pair-GPRO, weights policy updates by pair similarity to reinforce well-aligned supervision while mitigating task interference, with strong empirical gains on MMMU, MMStar, MME, GenEval, and Wise across 1B and 7B scales, plus transfer to Lumina-DiMOO. The authors also provide a 16K UG paired dataset (PairUG) and demonstrate generality across architectures, underlining the practical impact of data alignment and similarity-weighted credit assignment for unified multimodal training.

Abstract

Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Codes are available at https://github.com/Haochen-Wang409/PairUni.

Paper Structure

This paper contains 21 sections, 7 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Performance Conflict Mechanism Analysis: Median gradient cosine similarity scores between understanding and generation components, alongside benchmark performance on two understanding benchmarks (MMMU yue2024mmmu, MMStar chen2024we) and one image generation benchmark (GenEval ghosh2023geneval). The analysis encompasses six distinct data combination scenarios: PairUG, Retrieval-based Pairs, Unpair data with low similarity scores, pure Generation-only data, pure Understanding-only data, and Random Pairs.
  • Figure 2: Data Pairing Algorithm
  • Figure 3: Data Pairing Pipeline. Left: examples of aligned quadruples from generation and understanding tasks. Right: pairing strategy using retrieval and clustering.
  • Figure 4: Framework of PairUni: A dual-component design integrating a data processing pipeline and the GRPO reinforcement learning algorithm.
  • Figure 5: Case Study: The generated image of Janus-Pro-7B and PairUni
  • ...and 7 more figures