$\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language Model
Yufeng Zhang, Liyu Chen, Boyi Liu, Yingxiang Yang, Qiwen Cui, Yunzhe Tao, Hongxia Yang
TL;DR
This work presents the $(N,K)$-Puzzle as a scalable, cost-efficient testbed to benchmark reinforcement learning strategies for generative language models, enabling controlled variation of problem size and target value. It systematically evaluates RM-based RL (PPO with RM) against RM-free approaches (DPO, IPO) using a GPT-2 backbone, revealing that RM-based PPO can be undermined by RM hacking, while DPO/IPO exhibit stronger regularization but limited out-of-distribution generalization. The study provides detailed insights into the trade-offs of reward modeling, KL/top-$p$ regularization, and preference-based optimization, highlighting the puzzle’s potential as a standardized environment for comparing RL methods in LM training. Overall, $(N,K)$-Puzzle offers a practical, interpretable platform to diagnose, compare, and improve RL strategies for scalable language generation.
Abstract
Recent advances in reinforcement learning (RL) algorithms aim to enhance the performance of language models at scale. Yet, there is a noticeable absence of a cost-effective and standardized testbed tailored to evaluating and comparing these algorithms. To bridge this gap, we present a generalized version of the 24-Puzzle: the $(N,K)$-Puzzle, which challenges language models to reach a target value $K$ with $N$ integers. We evaluate the effectiveness of established RL algorithms such as Proximal Policy Optimization (PPO), alongside novel approaches like Identity Policy Optimization (IPO) and Direct Policy Optimization (DPO).
