Table of Contents
Fetching ...

Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit, Jason Benn, Daniel Kang

Abstract

Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for post-training. To understand the generalizability of RPT, we conduct two studies with specific focus on Reinforcement Learning with Verifiable Rewards (RLVR). (1) Observational: we compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data, the gains generalize inconsistently and can vanish on domains with different reasoning patterns.

Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Abstract

Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for post-training. To understand the generalizability of RPT, we conduct two studies with specific focus on Reinforcement Learning with Verifiable Rewards (RLVR). (1) Observational: we compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data, the gains generalize inconsistently and can vanish on domains with different reasoning patterns.

Paper Structure

This paper contains 26 sections, 5 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: The method (a) and key findings (b) of our work. Through a unified multi-domain evaluation framework combining observational and interventional studies, we find that RPT exhibits limited generalizability across domains.
  • Figure 2: RPT models on single domains show significant pass@1 improvements over base models and higher odds ratios on in-domain tasks, but not on out-of-domain tasks. No single-domain model achieves statistically significant gains in out-of-domain tasks.
  • Figure 3: Multi-domain evaluation results of existing RPT models. We highlight in-domain results with frames. RPT shows mutual generalizability between math and code, one-way transfer from knowledge-intensive reasoning to math and code, but no generalization from math or code to knowledge-intensive reasoning.
  • Figure 4: Multi-domain evaluation results of single domain RPT models. We highlight RPT domains with frames. RPT demonstrates generalizability from math to code and from knowledge-intensive reasoning to math, but shows no generalizability from math or code to knowledge-intensive reasoning.
  • Figure 5: In-domain and out-of-domain improvements during RPT training on the math domain. The gap between in-domain and out-of-domain improvements grows as training progresses.
  • ...and 8 more figures