Table of Contents
Fetching ...

Zero Reinforcement Learning Towards General Domains

Yuyuan Zeng, Yufei Huang, Can Xu, Qingfeng Sun, Jianfeng Yan, Guanghui Xu, Tao Yang, Fengzong Lian

TL;DR

This work extends zero reinforcement learning to general domains by merging verifiable rewards with a generative reward model in a unified General Zero-RL framework. It employs multi-task RL to transfer reasoning from STEM, verifiable tasks to open-ended general tasks, and introduces a smooth length penalty to mitigate reward hacking and encourage genuine thinking. Experiments on Qwen3-8B-Base and Qwen3-14B-Base show superior performance on math and general reasoning benchmarks and competitive results on general tasks, with ablations confirming the importance of multi-task training and the length-penalty mechanism. The approach advances practical reasoning capabilities of LLMs across diverse domains and demonstrates the benefits of incorporating non-verifiable data into RL training for broader generalization.

Abstract

Zero Reinforcement Learning (Zero-RL) has proven to be an effective approach for enhancing the reasoning capabilities of large language models (LLMs) by directly applying reinforcement learning with verifiable rewards on pretrained models, without the need for a supervised fine-tuning phase. However, current research on zero-RL primarily focuses on domains with easily verifiable reward signals, such as mathematics, programming, and other reasoning tasks. The challenge of eliciting reasoning abilities in more diverse scenarios, where verification is not straightforward, remains underexplored. To address this gap, we propose a novel zero-RL paradigm designed to improve a model's reasoning ability across both verifiable and non-verifiable domains. By combining verifiable rewards with a generative reward model, we conduct multi-task zero-RL training across both domains, facilitating the transfer of reasoning capabilities between them. Furthermore, to mitigate reward hacking in the generative reward model, we design a smooth length penalty that encourages the generation of more comprehensive thinking tokens in general domains. Experimental results on Qwen3-8B-Base and Qwen3-14B-Base demonstrate that our approach achieves superior reasoning performance, not only on tasks requiring extensive reasoning but also on more general tasks.

Zero Reinforcement Learning Towards General Domains

TL;DR

This work extends zero reinforcement learning to general domains by merging verifiable rewards with a generative reward model in a unified General Zero-RL framework. It employs multi-task RL to transfer reasoning from STEM, verifiable tasks to open-ended general tasks, and introduces a smooth length penalty to mitigate reward hacking and encourage genuine thinking. Experiments on Qwen3-8B-Base and Qwen3-14B-Base show superior performance on math and general reasoning benchmarks and competitive results on general tasks, with ablations confirming the importance of multi-task training and the length-penalty mechanism. The approach advances practical reasoning capabilities of LLMs across diverse domains and demonstrates the benefits of incorporating non-verifiable data into RL training for broader generalization.

Abstract

Zero Reinforcement Learning (Zero-RL) has proven to be an effective approach for enhancing the reasoning capabilities of large language models (LLMs) by directly applying reinforcement learning with verifiable rewards on pretrained models, without the need for a supervised fine-tuning phase. However, current research on zero-RL primarily focuses on domains with easily verifiable reward signals, such as mathematics, programming, and other reasoning tasks. The challenge of eliciting reasoning abilities in more diverse scenarios, where verification is not straightforward, remains underexplored. To address this gap, we propose a novel zero-RL paradigm designed to improve a model's reasoning ability across both verifiable and non-verifiable domains. By combining verifiable rewards with a generative reward model, we conduct multi-task zero-RL training across both domains, facilitating the transfer of reasoning capabilities between them. Furthermore, to mitigate reward hacking in the generative reward model, we design a smooth length penalty that encourages the generation of more comprehensive thinking tokens in general domains. Experimental results on Qwen3-8B-Base and Qwen3-14B-Base demonstrate that our approach achieves superior reasoning performance, not only on tasks requiring extensive reasoning but also on more general tasks.

Paper Structure

This paper contains 24 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our unified General Zero-RL framework. The framework performs multi-task learning over both general and reasoning tasks. To mitigate reward hacking in generative reward models, a length penalty is applied when the output answer exceeds the length of the generated thinking tokens.
  • Figure 2: Evolution of think content length and answer length (in terms of characters) for reasoning and general tasks over the training course of General-Zero-Qwen3-14B.
  • Figure 3: Accuracy of AIME24 and response length of General-Zero-Qwen3-14B and General-Zero-Qwen3-8B during the training process of zero reinforcement learning.
  • Figure 4: Evolution of think content and answer lengths (in terms of characters) on general data throughout the training of General-Zero-Qwen3-8B, comparing with and without length penalty.
  • Figure 5: Evolution of think content length and answer length (in terms of characters) on general data during the training process of General-Zero-Qwen3-8B models when trained with general-only data and multi-task data.