Table of Contents
Fetching ...

LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, Haw-Shiuan Chang, Shereen Oraby, Sijia Liu, Vivek Subramanian, Tagyoung Chung, Mohit Bansal, Nanyun Peng

TL;DR

The Decompose, Critique and Refine (DeCRIM) self-correction pipeline is proposed, which enhances LLMs' ability to follow constraints and improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.

Abstract

Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM's response needs refinement. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.

LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

TL;DR

The Decompose, Critique and Refine (DeCRIM) self-correction pipeline is proposed, which enhances LLMs' ability to follow constraints and improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.

Abstract

Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM's response needs refinement. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.
Paper Structure (59 sections, 6 figures, 12 tables)

This paper contains 59 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Example of user instruction where all subject LLMs failed. Responses from four LLMs are shown. All responses incorrectly include hashtags, despite a constraint explicitly requesting to not do so. Constraints in the instruction are highlighted in blue, and errors in LLM responses are highlighted in red.
  • Figure 1: RealInstruct Benchmark Workflow. Real-user original instruction is input into the Subject LLM, which generates a response. Using the original instruction, decomposed constraints, and the generated response, model-based evaluation assesses the quality of the response against each constraint one at a time, and then aggregates the results into an instruction-level metric.
  • Figure 2: The DeCRIM pipeline. Initially, the LLM generates a response to a user request. The Decomposer breaks down the request into granular constraints. A Critic model then gives feedback on whether the response meets all constraints. If it does, the response is output; if not, the feedback is used by LLM to refine the response. This Critique--Refine cycle repeats until all constraints are satisfied or the maximum number of iterations is reached.
  • Figure 3: Two-step Decompose-then-Generate (DtG) prompt: Inspired by the two-step Rephrase and Respond (RaR) deng2023rephrase, DtG first instructs the LLM to decompose multi-constrained instructions into an enumerated list of constraints. Then, DtG uses this decomposition as if it were the model's own "reasoning and planning" (leveraging the model's user and assistant tokens) to generate the final response. Like RaR, this process can be done in one or two steps, with the two-step method being more effective.
  • Figure 10: Sample Elements from RealInstruct Dataset - Part 1
  • ...and 1 more figures