Table of Contents
Fetching ...

LLMCRIT: Teaching Large Language Models to Use Criteria

Weizhe Yuan, Pengfei Liu, Matthias Gallé

TL;DR

LLMCrit presents a model-in-the-loop framework to teach large language models to generate task feedback guided by comprehensive, guideline-derived criteria. By automatically extracting criteria from guidelines, creating demonstrations per criterion, and evaluating feedback via a layered, multi-perspective approach, it demonstrates improvements in constructiveness and validity across paper introduction writing, Python code, and Reddit post tasks. The results reveal nuanced interactions: providing criteria generally helps, demonstrations can boost quality but may hinder contextualization if overly long, and adding both does not always outperform criteria alone. The study highlights practical strategies for scalable oversight and offers insights into granularity levels for criteria, with implications for broader, criterion-driven feedback systems. Overall, LLMCrit advances how LLMs can align feedback with human evaluative standards to enhance writing and coding tasks in real-world settings.

Abstract

Humans follow criteria when they execute tasks, and these criteria are directly used to assess the quality of task completion. Therefore, having models learn to use criteria to provide feedback can help humans or models to perform tasks better. However, existing research in this field tends to consider only a limited set of criteria or quality assessment aspects. To fill this gap, we propose a general framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. We choose three tasks from real-world scenarios to operationalize this idea: paper introduction writing, Python code writing, and Reddit post writing, and evaluate our feedback generation framework using different LLMs. The results reveal the fine-grained effects of incorporating criteria and demonstrations and provide valuable insights on how to teach LLMs to use criteria more effectively.

LLMCRIT: Teaching Large Language Models to Use Criteria

TL;DR

LLMCrit presents a model-in-the-loop framework to teach large language models to generate task feedback guided by comprehensive, guideline-derived criteria. By automatically extracting criteria from guidelines, creating demonstrations per criterion, and evaluating feedback via a layered, multi-perspective approach, it demonstrates improvements in constructiveness and validity across paper introduction writing, Python code, and Reddit post tasks. The results reveal nuanced interactions: providing criteria generally helps, demonstrations can boost quality but may hinder contextualization if overly long, and adding both does not always outperform criteria alone. The study highlights practical strategies for scalable oversight and offers insights into granularity levels for criteria, with implications for broader, criterion-driven feedback systems. Overall, LLMCrit advances how LLMs can align feedback with human evaluative standards to enhance writing and coding tasks in real-world settings.

Abstract

Humans follow criteria when they execute tasks, and these criteria are directly used to assess the quality of task completion. Therefore, having models learn to use criteria to provide feedback can help humans or models to perform tasks better. However, existing research in this field tends to consider only a limited set of criteria or quality assessment aspects. To fill this gap, we propose a general framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. We choose three tasks from real-world scenarios to operationalize this idea: paper introduction writing, Python code writing, and Reddit post writing, and evaluate our feedback generation framework using different LLMs. The results reveal the fine-grained effects of incorporating criteria and demonstrations and provide valuable insights on how to teach LLMs to use criteria more effectively.
Paper Structure (43 sections, 3 figures, 26 tables)

This paper contains 43 sections, 3 figures, 26 tables.

Figures (3)

  • Figure 1: Illustration of teaching LLMs to use criteria.
  • Figure 2: Our LLMCrit framework for teaching LLMs to use criteria. By applying a model-in-the-loop approach, we semi-automatically derive criteria and construct in-context demonstrations for each criterion. "Sec" stands for "section", "Crit" stands for "criterion", "IC" stands for "in-context". Step 1, 2, and 3 only need to be completed once, and the resulting criteria and demonstrations can be reused by different LLMs in Step 4 (shaded).
  • Figure 3: Hierarchical tree structure of the writing task guideline $G$.