To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

Haoqing Wang; Xiang Long; Ziheng Li; Yilong Xu; Tingguang Li; Yehui Tang

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

Haoqing Wang, Xiang Long, Ziheng Li, Yilong Xu, Tingguang Li, Yehui Tang

TL;DR

The paper tackles the challenge of building generalist, expert-level LLMs capable of solving tasks across multiple domains by comparing two RLVR strategies: mixed multi-task RLVR and separate-domain RLVR followed by model merging. Using Qwen3-4B-Base as the base model and Nemotron datasets across math, coding, science, and instruction following, it shows that mixed multi-task RLVR can achieve performance comparable to merging with only about a third of the GPU cost, while exhibiting cross-domain synergy and minimal interference. Through analyses of weight-space geometry, policy neighborhoods, and verification horizons, the authors reveal that multi-task training enables neighborhood policy transfer and emergent capabilities, whereas merging methods tend to preserve single-task skills and yield complementary gains. These findings provide practical guidance for scalable post-training strategies to equip LLMs with broad, reliable reasoning across diverse domains, highlighting the potential and trade-offs of RLVR in cross-domain settings.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, and instruction following) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, model prediction behavior, and information constraints. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/mosAI25/M2RL

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

TL;DR

Abstract

Paper Structure (20 sections, 1 equation, 8 figures, 8 tables)

This paper contains 20 sections, 1 equation, 8 figures, 8 tables.

Introduction
Related Works
Reinforcement Learning with Verifiable Rewards
Model Merging
Experiments and Analysis
Preliminary
Experimental Design and Results
Dataset blend.
Training.
Model Merging
Evaluation Results
Explore Weight Shift
Explore Policy Neighborhoods
Do Multi-Task Learners and Merged Models Acquire the Same Skills as Single-Task Models?
Locus of Error, Verification Horizon, and Multi-Task Synergy
...and 5 more sections

Figures (8)

Figure 1: The two training paradigms for multi-domain RLVR: mixed multi-task training and separate training followed by model merging.
Figure 2: The accuracy change trajectory of different benchmarks during math, coding, science and instruction following RLVR process.
Figure 3: The cross-domain cosine similarity of weight shift vectors in the overlapping regions. We report the average scores on attention weights (Q, K, V and O) and FFN weights (FFN-up, FFN-down and FFN-gate) respectively.
Figure 4: Cross-comparison of KL divergence. The y-axis represents the domain of the expert model, while the x-axis indicates the domain from which trajectories were sampled to compute the KL divergence. Each cell value represents the KL divergence. $\Delta \mathrm{Perf}$ represents the performance change of the multi-domain model relative to the domain expert on the sampled domains.
Figure 5: Accuracy gain consistency with union of single-task models on 5 benchmarks.
...and 3 more figures

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

TL;DR

Abstract

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)