Table of Contents
Fetching ...

The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

Subramanyam Sahoo

TL;DR

This paper studies how reward design affects RLHF-aligned fine-tuning of LLMs on mathematical reasoning tasks. It proposes hard, continuous, and hybrid reward formulations, plus an adaptive scheduler to balance exploration and stability. On GSM8K with a LoRA-tuned Qwen3-4B, the hard reward achieves the best final accuracy (~40%), continuous rewards offer greater training stability, and hybrid schedules provide intermediate gains, illustrating trade-offs between direct correctness and richer signal optimization. The work emphasizes careful reward calibration and scheduling as key drivers of alignment and provides a scalable framework for future reward-model experiments.

Abstract

Reward design is central to reinforcement learning from human feedback (RLHF) and alignment research. In this work, we propose a unified framework to study hard, continuous, and hybrid reward structures for fine-tuning large language models (LLMs) on mathematical reasoning tasks. Using Qwen3-4B with LoRA fine-tuning on the GSM8K dataset, we formalize and empirically evaluate reward formulations that incorporate correctness, perplexity, reasoning quality, and consistency. We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals, balancing exploration and stability. Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches, offering insights for alignment via adaptive reward modeling.

The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

TL;DR

This paper studies how reward design affects RLHF-aligned fine-tuning of LLMs on mathematical reasoning tasks. It proposes hard, continuous, and hybrid reward formulations, plus an adaptive scheduler to balance exploration and stability. On GSM8K with a LoRA-tuned Qwen3-4B, the hard reward achieves the best final accuracy (~40%), continuous rewards offer greater training stability, and hybrid schedules provide intermediate gains, illustrating trade-offs between direct correctness and richer signal optimization. The work emphasizes careful reward calibration and scheduling as key drivers of alignment and provides a scalable framework for future reward-model experiments.

Abstract

Reward design is central to reinforcement learning from human feedback (RLHF) and alignment research. In this work, we propose a unified framework to study hard, continuous, and hybrid reward structures for fine-tuning large language models (LLMs) on mathematical reasoning tasks. Using Qwen3-4B with LoRA fine-tuning on the GSM8K dataset, we formalize and empirically evaluate reward formulations that incorporate correctness, perplexity, reasoning quality, and consistency. We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals, balancing exploration and stability. Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches, offering insights for alignment via adaptive reward modeling.

Paper Structure

This paper contains 34 sections, 17 equations, 3 figures, 1 table, 2 algorithms.

Figures (3)

  • Figure 1: Performance Heatmap
  • Figure 2: Reward Components Evolution
  • Figure 3: Training Dynamics Comprehensive