Table of Contents
Fetching ...

MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, Jiacheng Zhu

TL;DR

MoDoMoDo presents a post-training framework that optimizes multimodal RLVR by learning a data-mixture surrogate over multiple vision-language datasets. It introduces a model-based approach that uses a quadratic surrogate, $\widehat{L}_{test}(w)=a+b^T w+w^T C w$, to predict test performance and select an optimal mixture $w^*\in\Delta_m$ for RLVR fine-tuning, reducing pilot runs. Empirically, multi-domain data mixtures improve generalization on both in-domain and out-of-domain VL benchmarks, with the model-based strategy achieving higher reliability and lower variance than seed or heuristic methods. The work advances cross-domain multimodal reasoning by providing a tunable, data-driven mechanism to allocate training signals across diverse verifiable-VL tasks, enabling broader reasoning capabilities with reduced compute.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training large language models (LLMs), achieving state-of-the-art performance on tasks with structured, verifiable answers. Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks that demand nuanced visual, logical, and spatial capabilities. As such, training MLLMs using RLVR on multiple datasets could be beneficial but creates challenges with conflicting objectives from interaction among diverse datasets, highlighting the need for optimal dataset mixture strategies to improve generalization and reasoning. We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation. Specifically, (1) We developed a multimodal RLVR framework for multi-dataset post-training by curating a dataset that contains different verifiable vision-language problems and enabling multi-domain online RL learning with different verifiable rewards; (2) We proposed a data mixture strategy that learns to predict the RL fine-tuning outcome from the data mixture distribution, and consequently optimizes the best mixture. Comprehensive experiments showcase that multi-domain RLVR training, when combined with mixture prediction strategies, can significantly boost MLLM general reasoning capacities. Our best mixture improves the post-trained model's accuracy on out-of-distribution benchmarks by an average of 5.24% compared to the same model post-trained with uniform data mixture, and by a total of 20.74% compared to the pre-finetuning baseline.

MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

TL;DR

MoDoMoDo presents a post-training framework that optimizes multimodal RLVR by learning a data-mixture surrogate over multiple vision-language datasets. It introduces a model-based approach that uses a quadratic surrogate, , to predict test performance and select an optimal mixture for RLVR fine-tuning, reducing pilot runs. Empirically, multi-domain data mixtures improve generalization on both in-domain and out-of-domain VL benchmarks, with the model-based strategy achieving higher reliability and lower variance than seed or heuristic methods. The work advances cross-domain multimodal reasoning by providing a tunable, data-driven mechanism to allocate training signals across diverse verifiable-VL tasks, enabling broader reasoning capabilities with reduced compute.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training large language models (LLMs), achieving state-of-the-art performance on tasks with structured, verifiable answers. Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks that demand nuanced visual, logical, and spatial capabilities. As such, training MLLMs using RLVR on multiple datasets could be beneficial but creates challenges with conflicting objectives from interaction among diverse datasets, highlighting the need for optimal dataset mixture strategies to improve generalization and reasoning. We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation. Specifically, (1) We developed a multimodal RLVR framework for multi-dataset post-training by curating a dataset that contains different verifiable vision-language problems and enabling multi-domain online RL learning with different verifiable rewards; (2) We proposed a data mixture strategy that learns to predict the RL fine-tuning outcome from the data mixture distribution, and consequently optimizes the best mixture. Comprehensive experiments showcase that multi-domain RLVR training, when combined with mixture prediction strategies, can significantly boost MLLM general reasoning capacities. Our best mixture improves the post-trained model's accuracy on out-of-distribution benchmarks by an average of 5.24% compared to the same model post-trained with uniform data mixture, and by a total of 20.74% compared to the pre-finetuning baseline.

Paper Structure

This paper contains 38 sections, 2 equations, 8 figures, 2 tables, 4 algorithms.

Figures (8)

  • Figure 1: MoDoMoDo is a framework that combines Multi-Domain Data Mixtures with Multimodal LLM Reinforcement Learning, enabling generalizable performance gain across diverse VL tasks. Models trained with our estimated optimal mixtures can outperform those trained with naive mixtures on in-domain and out-of-domain benchmarks.
  • Figure 2: Demonstration of a General Question-Answer Pair With and Without Reasoning.
  • Figure 3: Model Performance before / after GRPO training on All data mixture.
  • Figure 4: Model Performance Comparison after GRPO training using All data mixture and $3$Single data mixtures that have in-distribution test set.
  • Figure 5: Model Performance Comparison after GRPO training using All data mixture and $5$Exclude-One data mixtures.
  • ...and 3 more figures