Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Chuxuan Hu; Yuxuan Zhu; Antony Kellermann; Caleb Biddulph; Suppakit Waiwitlikhit; Jason Benn; Daniel Kang

Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Chuxuan Hu, Yuxuan Zhu, Antony Kellermann, Caleb Biddulph, Suppakit Waiwitlikhit, Jason Benn, Daniel Kang

Abstract

Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for post-training. To understand the generalizability of RPT, we conduct two studies with specific focus on Reinforcement Learning with Verifiable Rewards (RLVR). (1) Observational: we compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data, the gains generalize inconsistently and can vanish on domains with different reasoning patterns.

Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Abstract

Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Abstract

Paper Structure

Table of Contents

Figures (13)