Table of Contents
Fetching ...

RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior

Junyao Yang, Jianwei Wang, Huiping Zhuang, Cen Chen, Ziqian Zeng

TL;DR

The paper tackles the difficulty of merging long chain-of-thought (CoT) reasoning models with domain-specific models without eroding reasoning or output quality. It introduces RCP-Merging, a framework that treats reasoning capability as a prior and uses a Reasoning Preservation Indicator alongside Domain Knowledge Sensitivity to selectively merge domain weights while preserving core reasoning through a Bayesian prior (via a diagonal Fisher Information Matrix). The approach yields state-of-the-art or near-state-of-the-art results on BioMedicine and Finance benchmarks, maintains low gibberish output, and demonstrates cross-architecture robustness with emergent long-CoT behavior in domain problems. These findings suggest a practical pathway to versatile, dual-capability LLMs that balance specialized knowledge with complex multi-step reasoning, with broad implications for scalable deployment across domains.

Abstract

Large Language Models (LLMs) with long chain-of-thought (CoT) capability, termed Reasoning Models, demonstrate superior intricate problem-solving abilities through multi-step long CoT reasoning. To create a dual-capability model with long CoT capability and domain-specific knowledge without substantial computational and data costs, model merging emerges as a highly resource-efficient method. However, significant challenges lie in merging domain-specific LLMs with long CoT ones since nowadays merging methods suffer from reasoning capability degradation, even gibberish output and output collapse. To overcome this, we introduce RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior, a novel merging framework designed to integrate domain-specific LLMs with long CoT capability, meanwhile maintaining model performance in the original domain. Treating reasoning model weights as foundational prior, our method utilizes a reasoning capability indicator to preserve core long CoT capability model weights while selectively merging essential domain-specific weights. We conducted extensive experiments on Qwen2.5-7B, Llama3.1-8B, and Qwen2.5-1.5B models in BioMedicine and Finance domains. Our results show that RCP-Merging successfully merges a reasoning model with domain-specific ones, improving domain task performance by 9.5% and 9.2% over state-of-the-art methods, without significantly harming the original long CoT reasoning capability.

RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior

TL;DR

The paper tackles the difficulty of merging long chain-of-thought (CoT) reasoning models with domain-specific models without eroding reasoning or output quality. It introduces RCP-Merging, a framework that treats reasoning capability as a prior and uses a Reasoning Preservation Indicator alongside Domain Knowledge Sensitivity to selectively merge domain weights while preserving core reasoning through a Bayesian prior (via a diagonal Fisher Information Matrix). The approach yields state-of-the-art or near-state-of-the-art results on BioMedicine and Finance benchmarks, maintains low gibberish output, and demonstrates cross-architecture robustness with emergent long-CoT behavior in domain problems. These findings suggest a practical pathway to versatile, dual-capability LLMs that balance specialized knowledge with complex multi-step reasoning, with broad implications for scalable deployment across domains.

Abstract

Large Language Models (LLMs) with long chain-of-thought (CoT) capability, termed Reasoning Models, demonstrate superior intricate problem-solving abilities through multi-step long CoT reasoning. To create a dual-capability model with long CoT capability and domain-specific knowledge without substantial computational and data costs, model merging emerges as a highly resource-efficient method. However, significant challenges lie in merging domain-specific LLMs with long CoT ones since nowadays merging methods suffer from reasoning capability degradation, even gibberish output and output collapse. To overcome this, we introduce RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior, a novel merging framework designed to integrate domain-specific LLMs with long CoT capability, meanwhile maintaining model performance in the original domain. Treating reasoning model weights as foundational prior, our method utilizes a reasoning capability indicator to preserve core long CoT capability model weights while selectively merging essential domain-specific weights. We conducted extensive experiments on Qwen2.5-7B, Llama3.1-8B, and Qwen2.5-1.5B models in BioMedicine and Finance domains. Our results show that RCP-Merging successfully merges a reasoning model with domain-specific ones, improving domain task performance by 9.5% and 9.2% over state-of-the-art methods, without significantly harming the original long CoT reasoning capability.

Paper Structure

This paper contains 34 sections, 19 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Performance comparison of RCP-Merging and other methods in merging Qwen2.5-7B, Meditron3-Qwen2.5-7B, and DeepSeek-R1-Distill-Qwen-7B on eight datasets in Math, Code, BioMedicine, and Knowledge areas.
  • Figure 2: RCP-Merging consists of three stages. (1) Domain Knowledge Sensitivity. This step quantifies each weight's importance for a specific domain by measuring the change in model loss when that weight is removed. (2) Reasoning Preservation Indicator. To protect the model's core reasoning capabilities, this stage applies a preservation term to weights that are crucial for reasoning. (3) Reasoning-preserved Merging. The final stage balances domain sensitivity and the reasoning preserving matrix, merging only the weights that enhance domain knowledge without harming reasoning capabilities.
  • Figure 3: Gibberish rate comparison for merging Qwen2.5-7B (Base), Meditron3-Qwen2.5-7B (BioMedicine), and DeepSeek-R1-Distill-Qwen-7B (Reasoning) on all datasets, where a lower rate indicates higher-quality content.
  • Figure 4: Hyperparameter Analysis. periments are conducted when merging Qwen2.5-7B (Base), Meditron3-Qwen2.5-7B (BioMedicine) and DeepSeek-R1-Distill-Qwen-7B (Reasoning) on BioMedicine datasets in Figure \ref{['fig:hyper_med']} and Reasoning datasets in Figure \ref{['fig:hyper_reason']}. Merged Model performance is evaluated under different Reasoning-preserving coefficients $\lambda$.
  • Figure 5: Hyperparameter Analysis. periments are conducted when merging Qwen2.5-7B (Base), Meditron3-Qwen2.5-7B (BioMedicine) and DeepSeek-R1-Distill-Qwen-7B (Reasoning) on BioMedicine datasets in Figure \ref{['fig:hyper_med']} and Reasoning datasets in Figure \ref{['fig:hyper_reason']}. Merged Model performance is evaluated under different Reasoning-preserving coefficients $\lambda$.
  • ...and 1 more figures