$\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model

Yufeng Zhang; Liyu Chen; Boyi Liu; Yingxiang Yang; Qiwen Cui; Yunzhe Tao; Hongxia Yang

$\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model

Yufeng Zhang, Liyu Chen, Boyi Liu, Yingxiang Yang, Qiwen Cui, Yunzhe Tao, Hongxia Yang

TL;DR

This work presents the $(N,K)$-Puzzle as a scalable, cost-efficient testbed to benchmark reinforcement learning strategies for generative language models, enabling controlled variation of problem size and target value. It systematically evaluates RM-based RL (PPO with RM) against RM-free approaches (DPO, IPO) using a GPT-2 backbone, revealing that RM-based PPO can be undermined by RM hacking, while DPO/IPO exhibit stronger regularization but limited out-of-distribution generalization. The study provides detailed insights into the trade-offs of reward modeling, KL/top-$p$ regularization, and preference-based optimization, highlighting the puzzle’s potential as a standardized environment for comparing RL methods in LM training. Overall, $(N,K)$-Puzzle offers a practical, interpretable platform to diagnose, compare, and improve RL strategies for scalable language generation.

Abstract

Recent advances in reinforcement learning (RL) algorithms aim to enhance the performance of language models at scale. Yet, there is a noticeable absence of a cost-effective and standardized testbed tailored to evaluating and comparing these algorithms. To bridge this gap, we present a generalized version of the 24-Puzzle: the $(N,K)$-Puzzle, which challenges language models to reach a target value $K$ with $N$ integers. We evaluate the effectiveness of established RL algorithms such as Proximal Policy Optimization (PPO), alongside novel approaches like Identity Policy Optimization (IPO) and Direct Policy Optimization (DPO).

$\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model

TL;DR

This work presents the

-Puzzle as a scalable, cost-efficient testbed to benchmark reinforcement learning strategies for generative language models, enabling controlled variation of problem size and target value. It systematically evaluates RM-based RL (PPO with RM) against RM-free approaches (DPO, IPO) using a GPT-2 backbone, revealing that RM-based PPO can be undermined by RM hacking, while DPO/IPO exhibit stronger regularization but limited out-of-distribution generalization. The study provides detailed insights into the trade-offs of reward modeling, KL/top-

regularization, and preference-based optimization, highlighting the puzzle’s potential as a standardized environment for comparing RL methods in LM training. Overall,

-Puzzle offers a practical, interpretable platform to diagnose, compare, and improve RL strategies for scalable language generation.

Abstract

-Puzzle, which challenges language models to reach a target value

with

integers. We evaluate the effectiveness of established RL algorithms such as Proximal Policy Optimization (PPO), alongside novel approaches like Identity Policy Optimization (IPO) and Direct Policy Optimization (DPO).

Paper Structure (19 sections, 6 equations, 4 figures, 3 tables)

This paper contains 19 sections, 6 equations, 4 figures, 3 tables.

Introduction
Background
Reward Model Training
Reinforcement Learning with an RM
Reinforcement Learning without an RM
Problem Setup: $\mathbf{(N, K)}$-Puzzle
Experiments
Experiment Setup
Reward Model
PPO
DPO and IPO
Conclusion
Ethical Statement
Limitations
Examples of Responses with Different Ground Truth Reward
...and 4 more sections

Figures (4)

Figure 1: Training of the model: The process of training consists of multiple parts. Firstly, we train the model with SFT to align the model with the $(N, K)$-Puzzle. Then, we perform RL training (PPO, DPO, IPO, and RM) based on the SFT model.
Figure 2: Performance of PPO with ground-truth reward and RM. While PPO with the ground truth reward keeps boosting both the in-distribution and OOD accuracies, PPO with RM start to see performance degradation after a short period of training.
Figure 3: Test accuracy for DPO and IPO for different values of regularization parameter $\beta$. We observe that IPO is more robust across different $\beta$. It is worth noting that both DPO and IPO fail to enhance generalization of the generative LMs.
Figure 4: Training dynamics of PPO with reward model without KL or top-$p$ regularization.

$\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model

TL;DR

Abstract

$\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (4)