Table of Contents
Fetching ...

RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

Yang Liu, Jiaqi Li, Zilong Zheng

TL;DR

RuleReasoner is introduced, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL that facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by human.

Abstract

Rule-based reasoning is acknowledged as one of the fundamental problems of reasoning. While recent studies show that large reasoning models (LRMs) have remarkable reasoning capabilities enhanced by reinforcement learning (RL), real applications still face severe challenges due to variations in rule formats, types, and complexity. To mitigate this issue, we introduce RuleReasoner, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL. Specifically, RuleReasoner resamples each training batch by updating the domain weights based on historical rewards. This facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by human. Evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($Δ$4.1% on eight ID tasks and $Δ$10.4% on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior methods.

RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

TL;DR

RuleReasoner is introduced, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL that facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by human.

Abstract

Rule-based reasoning is acknowledged as one of the fundamental problems of reasoning. While recent studies show that large reasoning models (LRMs) have remarkable reasoning capabilities enhanced by reinforcement learning (RL), real applications still face severe challenges due to variations in rule formats, types, and complexity. To mitigate this issue, we introduce RuleReasoner, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL. Specifically, RuleReasoner resamples each training batch by updating the domain weights based on historical rewards. This facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by human. Evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin (4.1% on eight ID tasks and 10.4% on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior methods.

Paper Structure

This paper contains 43 sections, 3 equations, 11 figures, 15 tables, 1 algorithm.

Figures (11)

  • Figure 1: Out-of-distribution performance comparison between RuleReasoner (8B and 4B) and other frontier reasoning models on challenging rule-based reasoning benchmarks.
  • Figure 2: RuleReasoner training recipe.
  • Figure 3: Demonstration overview of RuleCollection-32K.
  • Figure 4: Learning dynamics by domains. "blueReward" represents the training reward obtained from tasks and "OliveGreenPass@1" denotes validation pass@1 performance. We employ exponential moving average smoothing for clearly displaying the curves "blueReward", "OliveGreenPass@1", and "redDomain Weights".
  • Figure 5: Impact on incremental task mixing recipes.
  • ...and 6 more figures