Table of Contents
Fetching ...

Checklists Are Better Than Reward Models For Aligning Language Models

Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, Tongshuang Wu

TL;DR

The paper tackles instruction-following alignment by moving beyond fixed reward criteria and introducing Reinforcement Learning from Checklist Feedback (RLCF), which uses instruction-derived checklists evaluated by AI judges and verifier programs to generate rewards. It builds WildChecklists (130k instructions) and demonstrates that RLCF consistently improves a strong instruction-following model across five benchmarks, outperforming instruction finetuning, reward-model baselines, and single-rubric judges. The results show checklist-based rewards provide stable, interpretable signals and can be applied off-policy to other model families, albeit with notable compute costs. The work suggests a promising direction for RL-based LM alignment that leverages dynamic, instruction-specific rubrics rather than static, global criteria.

Abstract

Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this -- typically using fixed criteria such as "helpfulness" and "harmfulness". In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose "Reinforcement Learning from Checklist Feedback" (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item - using both AI judges and specialized verifier programs - then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks -- RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. These results establish checklist feedback as a key tool for improving language models' support of queries that express a multitude of needs.

Checklists Are Better Than Reward Models For Aligning Language Models

TL;DR

The paper tackles instruction-following alignment by moving beyond fixed reward criteria and introducing Reinforcement Learning from Checklist Feedback (RLCF), which uses instruction-derived checklists evaluated by AI judges and verifier programs to generate rewards. It builds WildChecklists (130k instructions) and demonstrates that RLCF consistently improves a strong instruction-following model across five benchmarks, outperforming instruction finetuning, reward-model baselines, and single-rubric judges. The results show checklist-based rewards provide stable, interpretable signals and can be applied off-policy to other model families, albeit with notable compute costs. The work suggests a promising direction for RL-based LM alignment that leverages dynamic, instruction-specific rubrics rather than static, global criteria.

Abstract

Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this -- typically using fixed criteria such as "helpfulness" and "harmfulness". In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose "Reinforcement Learning from Checklist Feedback" (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item - using both AI judges and specialized verifier programs - then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks -- RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. These results establish checklist feedback as a key tool for improving language models' support of queries that express a multitude of needs.

Paper Structure

This paper contains 21 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: RL on Checklist Feedback consistently improves Qwen2.5-7B-Instruct, whereas every other source of automatic feedback gives mixed results.
  • Figure 2: We propose Reinforcement Learning from Checklist Feedback, where sampled responses are evaluated by a teacher model grounded on a fixed set of criteria. In our pipeline, given instructions, we first generate checklists synthetically from the instructions, grade each response on each checklist item, combine per-item scores into a single weighted checklist score, then use this score for RL.
  • Figure 3: Checklist feedback can be viewed as an extreme mixture-of-evaluators, where the space of (prompted) evaluators is unbounded and a unique subset of evaluators is chosen for each instruction.
  • Figure 4: RLCF samples 25 scores when grading each requirement. This is expensive. Fortunately, much of the efficacy is retained using just 5 samples (55% less clock time).
  • Figure 5: Impact of different filtering strategies on model performance on FollowBench and InFoBench. We compare filtering pairs based on overall checklist score differences versus filtering based on single-aspect score differences, at varying dataset sizes. There are only slight differences between these two filtering methods, until we start filtering out the vast majority of the data. This suggests that the reward signal, rather than the specific filtering algorithm, is likely responsible for this method's effectiveness.
  • ...and 2 more figures