Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order
Prakhar Gupta, Vaibhav Gupta
TL;DR
This work investigates whether a scalar hint toward a canonical solver order, applied only during RL post-training, can improve Sudoku solving when fine-tuning occurs on randomized solution sequences. A Transformer is fine-tuned on Sudoku data and then post-trained with GRPO using two rewards—cell accuracy and an ordering signal aligned with the solver order—combined via fixed mixtures and stabilized by bootstrapped scaling. Results show that mixed rewards generally outperform cell-only optimization, with the best mix (cell:order = 0.75:0.25) achieving near-solver-order accuracy and demonstrating that coarse ordering signals can bias RL post-training toward solver-like trajectories without modifying data or architecture. This suggests a practical, modular knob for injecting structural priors into post-training, potentially transferring to other structured tasks beyond Sudoku.
Abstract
Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during RL post-training, improves performance even when fine-tuned on randomized solution sequences. On Sudoku, we train a Transformer with standard fine-tuning on randomized solving orders, then post-train it with Group Relative Policy Optimization (GRPO) with two rewards: cell accuracy and an ordering reward that increases when the model's emission order aligns with the solver order. To compare signals cleanly, we combine them via fixed mixtures and use a simple bootstrapped scaling to equalize component magnitudes at initialization. Mixed rewards generally outperform cell-only optimization--the best mixture yields substantially higher test accuracy than the fine-tuned-only model trained on random-order and approaches the fine-tuned-only model trained on solver-order sequences in accuracy. These results suggest that coarse ordering signals can steer RL post-training toward solver-order trajectories without modifying supervised data or architecture.
