IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Bosi Wen; Yilin Niu; Cunxiang Wang; Xiaoying Ling; Ying Zhang; Pei Ke; Hongning Wang; Minlie Huang

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Bosi Wen, Yilin Niu, Cunxiang Wang, Xiaoying Ling, Ying Zhang, Pei Ke, Hongning Wang, Minlie Huang

TL;DR

IF-RewardBench is proposed, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types and enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment.

Abstract

Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

TL;DR

Abstract

Paper Structure (52 sections, 3 equations, 7 figures, 20 tables)

This paper contains 52 sections, 3 equations, 7 figures, 20 tables.

Introduction
Related Work
Instruction-Following.
Evaluation of Judge Models.
IF-RewardBench
Task Definition
Evaluation Tasks of Judge Models
Core Capabilities of Judge Models
Dataset Construction
Data Source
Instruction Collection.
Instruction Filtration.
Instruction Decomposition.
Response Generation.
Preference Graph Curation
...and 37 more sections

Figures (7)

Figure 1: An example from IF-RewardBench, containing a user instruction, a constraint checklist, and multiple responses with various instruction-following quality that form a preference graph.
Figure 2: Overall framework of IF-RewardBench. Left: Collect instructions and responses from diverse sources. Center: Curate preference graphs via multi-stage annotation and verification. Right: Assess various judge models based on different evaluation paradigms.
Figure 3: The distribution of constraint categories and constraint composition types for instructions in IF-RewardBench.
Figure 4: The performance of judge models in verification across different constraint categories and composition types. The constraint composition types are bold.
Figure 5: Various factors that influence the performance of judge models in ranking. "CA" and "OA" denote Constraint Assessment and Overall Assessment, respectively.
...and 2 more figures

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

TL;DR

Abstract

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)