Table of Contents
Fetching ...

Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

Prakhar Gupta, Vaibhav Gupta

TL;DR

This work investigates whether a scalar hint toward a canonical solver order, applied only during RL post-training, can improve Sudoku solving when fine-tuning occurs on randomized solution sequences. A Transformer is fine-tuned on Sudoku data and then post-trained with GRPO using two rewards—cell accuracy and an ordering signal aligned with the solver order—combined via fixed mixtures and stabilized by bootstrapped scaling. Results show that mixed rewards generally outperform cell-only optimization, with the best mix (cell:order = 0.75:0.25) achieving near-solver-order accuracy and demonstrating that coarse ordering signals can bias RL post-training toward solver-like trajectories without modifying data or architecture. This suggests a practical, modular knob for injecting structural priors into post-training, potentially transferring to other structured tasks beyond Sudoku.

Abstract

Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during RL post-training, improves performance even when fine-tuned on randomized solution sequences. On Sudoku, we train a Transformer with standard fine-tuning on randomized solving orders, then post-train it with Group Relative Policy Optimization (GRPO) with two rewards: cell accuracy and an ordering reward that increases when the model's emission order aligns with the solver order. To compare signals cleanly, we combine them via fixed mixtures and use a simple bootstrapped scaling to equalize component magnitudes at initialization. Mixed rewards generally outperform cell-only optimization--the best mixture yields substantially higher test accuracy than the fine-tuned-only model trained on random-order and approaches the fine-tuned-only model trained on solver-order sequences in accuracy. These results suggest that coarse ordering signals can steer RL post-training toward solver-order trajectories without modifying supervised data or architecture.

Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

TL;DR

This work investigates whether a scalar hint toward a canonical solver order, applied only during RL post-training, can improve Sudoku solving when fine-tuning occurs on randomized solution sequences. A Transformer is fine-tuned on Sudoku data and then post-trained with GRPO using two rewards—cell accuracy and an ordering signal aligned with the solver order—combined via fixed mixtures and stabilized by bootstrapped scaling. Results show that mixed rewards generally outperform cell-only optimization, with the best mix (cell:order = 0.75:0.25) achieving near-solver-order accuracy and demonstrating that coarse ordering signals can bias RL post-training toward solver-like trajectories without modifying data or architecture. This suggests a practical, modular knob for injecting structural priors into post-training, potentially transferring to other structured tasks beyond Sudoku.

Abstract

Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during RL post-training, improves performance even when fine-tuned on randomized solution sequences. On Sudoku, we train a Transformer with standard fine-tuning on randomized solving orders, then post-train it with Group Relative Policy Optimization (GRPO) with two rewards: cell accuracy and an ordering reward that increases when the model's emission order aligns with the solver order. To compare signals cleanly, we combine them via fixed mixtures and use a simple bootstrapped scaling to equalize component magnitudes at initialization. Mixed rewards generally outperform cell-only optimization--the best mixture yields substantially higher test accuracy than the fine-tuned-only model trained on random-order and approaches the fine-tuned-only model trained on solver-order sequences in accuracy. These results suggest that coarse ordering signals can steer RL post-training toward solver-order trajectories without modifying supervised data or architecture.

Paper Structure

This paper contains 19 sections, 4 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Reward mixtures and performance.(a) Effect of reward mixing on Sudoku cell accuracy. Each point is GRPO post-trained on the fine-tuned (random order) model at the indicated $\alpha$. Horizontal lines show fine-tuned baselines. (b) Test cell accuracy under different cell-to-order reward mixtures for GRPO post-training (starting from the random-order fine-tuned model).