Table of Contents
Fetching ...

IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation

Bosi Wen, Yilin Niu, Cunxiang Wang, Pei Ke, Xiaoying Ling, Ying Zhang, Aohan Zeng, Hongning Wang, Minlie Huang

TL;DR

IF-Critic introduces a fine-grained LLM critic that evaluates instruction-following by decomposing instructions with constraint checklists and producing per-constraint critiques in a single inference pass. It uses a multi-stage critique filtering pipeline and a constraint-level preference optimization to train a 14B parameter model, achieving superior evaluation performance against strong LLM-as-a-Judge baselines and enabling scalable reward signals for instruction-following optimization with lower computational overhead. The framework combines checklist-driven critique generation, cross-model and rule-based verification, majority-consensus judgment, and MBP-style explanation selection to ensure reliability. Empirical results show substantial gains in both instruction-following evaluation and optimization on multiple benchmarks (EvalCritic, CFBench, TRACE, Multi-IF, SysBench), with ablations highlighting the importance of each component and explanations deemed high-quality by human judges. The work demonstrates that constraint-aware critiques can effectively guide LLM alignment through DPO and GRPO, offering a practical, scalable alternative to relying on large proprietary judges like GPT-4o for evaluation signals.

Abstract

Instruction following is a fundamental ability of Large Language Models (LLMs), requiring their generated outputs to follow multiple constraints imposed in input instructions. Numerous studies have attempted to enhance this ability through preference optimization or reinforcement learning based on reward signals from LLM-as-a-Judge. However, existing evaluation models for instruction following still possess many deficiencies, such as substantial costs and unreliable assessments. To this end, we propose IF-CRITIC, an LLM critic that can provide efficient and reliable assessments of constraint following in the instructions. We first develop a checklist generator to decompose instructions and generate constraint checklists. With the assistance of the checklists, we collect high-quality critique training data through a multi-stage critique filtering mechanism and employ a constraint-level preference optimization method to train IF-CRITIC. Extensive experiments demonstrate that the evaluation performance of IF-CRITIC can beat strong LLM-as-a-Judge baselines, including Deepseek-R1 and o4-mini. With the scalable reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization under lower computational overhead compared to strong LLM critic baselines.

IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation

TL;DR

IF-Critic introduces a fine-grained LLM critic that evaluates instruction-following by decomposing instructions with constraint checklists and producing per-constraint critiques in a single inference pass. It uses a multi-stage critique filtering pipeline and a constraint-level preference optimization to train a 14B parameter model, achieving superior evaluation performance against strong LLM-as-a-Judge baselines and enabling scalable reward signals for instruction-following optimization with lower computational overhead. The framework combines checklist-driven critique generation, cross-model and rule-based verification, majority-consensus judgment, and MBP-style explanation selection to ensure reliability. Empirical results show substantial gains in both instruction-following evaluation and optimization on multiple benchmarks (EvalCritic, CFBench, TRACE, Multi-IF, SysBench), with ablations highlighting the importance of each component and explanations deemed high-quality by human judges. The work demonstrates that constraint-aware critiques can effectively guide LLM alignment through DPO and GRPO, offering a practical, scalable alternative to relying on large proprietary judges like GPT-4o for evaluation signals.

Abstract

Instruction following is a fundamental ability of Large Language Models (LLMs), requiring their generated outputs to follow multiple constraints imposed in input instructions. Numerous studies have attempted to enhance this ability through preference optimization or reinforcement learning based on reward signals from LLM-as-a-Judge. However, existing evaluation models for instruction following still possess many deficiencies, such as substantial costs and unreliable assessments. To this end, we propose IF-CRITIC, an LLM critic that can provide efficient and reliable assessments of constraint following in the instructions. We first develop a checklist generator to decompose instructions and generate constraint checklists. With the assistance of the checklists, we collect high-quality critique training data through a multi-stage critique filtering mechanism and employ a constraint-level preference optimization method to train IF-CRITIC. Extensive experiments demonstrate that the evaluation performance of IF-CRITIC can beat strong LLM-as-a-Judge baselines, including Deepseek-R1 and o4-mini. With the scalable reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization under lower computational overhead compared to strong LLM critic baselines.

Paper Structure

This paper contains 34 sections, 6 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: A usage example of IF-Critic: Given an instruction and a response, a checklist generator first decomposes the instruction to generate a constraint checklist. Then, IF-Critic can provide fine-grained evaluations for the response with respect to its following of all included constraints.
  • Figure 2: The pipeline of IF-Critic development. The left section illustrates the process of critique training data construction, while the right section presents the process of training IF-Critic.
  • Figure 3: Explanation quality evaluation results. The percentages indicate the preference between IF-Critic and other evaluation models via human annotation.
  • Figure 4: Reward curves during GRPO training when IF-Critic and QwQ-32B are employed as the LLM critics. For LLama-3.1-8B-Instruct, training with QwQ-32B results in a model collapse after 300 steps, with the model tending to generate extensive repetitive and meaningless content. Due to efficiency considerations, we terminate further training and calculate the average per-step training time for all critics and reward models based on the first 300 steps.
  • Figure 5: Reward curves during GRPO training when Skywork-Reward-V2-Llama-3.1-8B-40M is employed as the reward model.