Table of Contents
Fetching ...

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Bosi Wen, Yilin Niu, Cunxiang Wang, Xiaoying Ling, Ying Zhang, Pei Ke, Hongning Wang, Minlie Huang

TL;DR

IF-RewardBench is proposed, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types and enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment.

Abstract

Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

TL;DR

IF-RewardBench is proposed, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types and enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment.

Abstract

Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.
Paper Structure (52 sections, 3 equations, 7 figures, 20 tables)

This paper contains 52 sections, 3 equations, 7 figures, 20 tables.

Figures (7)

  • Figure 1: An example from IF-RewardBench, containing a user instruction, a constraint checklist, and multiple responses with various instruction-following quality that form a preference graph.
  • Figure 2: Overall framework of IF-RewardBench. Left: Collect instructions and responses from diverse sources. Center: Curate preference graphs via multi-stage annotation and verification. Right: Assess various judge models based on different evaluation paradigms.
  • Figure 3: The distribution of constraint categories and constraint composition types for instructions in IF-RewardBench.
  • Figure 4: The performance of judge models in verification across different constraint categories and composition types. The constraint composition types are bold.
  • Figure 5: Various factors that influence the performance of judge models in ranking. "CA" and "OA" denote Constraint Assessment and Overall Assessment, respectively.
  • ...and 2 more figures