The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training
Subramanyam Sahoo
TL;DR
This paper studies how reward design affects RLHF-aligned fine-tuning of LLMs on mathematical reasoning tasks. It proposes hard, continuous, and hybrid reward formulations, plus an adaptive scheduler to balance exploration and stability. On GSM8K with a LoRA-tuned Qwen3-4B, the hard reward achieves the best final accuracy (~40%), continuous rewards offer greater training stability, and hybrid schedules provide intermediate gains, illustrating trade-offs between direct correctness and richer signal optimization. The work emphasizes careful reward calibration and scheduling as key drivers of alignment and provides a scalable framework for future reward-model experiments.
Abstract
Reward design is central to reinforcement learning from human feedback (RLHF) and alignment research. In this work, we propose a unified framework to study hard, continuous, and hybrid reward structures for fine-tuning large language models (LLMs) on mathematical reasoning tasks. Using Qwen3-4B with LoRA fine-tuning on the GSM8K dataset, we formalize and empirically evaluate reward formulations that incorporate correctness, perplexity, reasoning quality, and consistency. We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals, balancing exploration and stability. Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches, offering insights for alignment via adaptive reward modeling.
