Table of Contents
Fetching ...

Evaluating LLMs at Detecting Errors in LLM Responses

Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, Salika Dave, Shaobo Qin, Arman Cohan, Wenpeng Yin, Rui Zhang

TL;DR

This paper tackles the lack of objective benchmarks for detecting errors in LLM outputs by introducing ReaLMistake, a binary-error benchmark built from three diverse tasks (MathGen, FgFactV, AnsCls) that elicit objective errors across four categories (Reasoning Correctness, Instruction-Following, Context-Faithfulness, Parameterized Knowledge). It provides 900 expert-annotated instances from GPT-4-0613 and Llama 70B, paired with 12 detectors to evaluate error-detection performance under zero-shot prompts. The study reveals that even strong LLMs exhibit very low recall in detecting errors, explanations are unreliable, and common improvement techniques (self-consistency, majority voting, or extra evaluation steps) do not reliably boost performance, with prompt design significantly influencing results. Overall, ReaLMistake offers a realistic, objective, and challenging framework for advancing error-detection methods and prompts further research into robust detectors, with code and data openly available.

Abstract

With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g., word sorting) or limited error types (e.g., faithfulness in summarization). This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs. ReaLMistake contains three challenging and meaningful tasks that introduce objectively assessable errors in four categories (reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge), eliciting naturally observed and diverse errors in responses of GPT-4 and Llama 2 70B annotated by experts. We use ReaLMistake to evaluate error detectors based on 12 LLMs. Our findings show: 1) Top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans. 2) Explanations by LLM-based error detectors lack reliability. 3) LLMs-based error detection is sensitive to small changes in prompts but remains challenging to improve. 4) Popular approaches to improving LLMs, including self-consistency and majority vote, do not improve the error detection performance. Our benchmark and code are provided at https://github.com/psunlpgroup/ReaLMistake.

Evaluating LLMs at Detecting Errors in LLM Responses

TL;DR

This paper tackles the lack of objective benchmarks for detecting errors in LLM outputs by introducing ReaLMistake, a binary-error benchmark built from three diverse tasks (MathGen, FgFactV, AnsCls) that elicit objective errors across four categories (Reasoning Correctness, Instruction-Following, Context-Faithfulness, Parameterized Knowledge). It provides 900 expert-annotated instances from GPT-4-0613 and Llama 70B, paired with 12 detectors to evaluate error-detection performance under zero-shot prompts. The study reveals that even strong LLMs exhibit very low recall in detecting errors, explanations are unreliable, and common improvement techniques (self-consistency, majority voting, or extra evaluation steps) do not reliably boost performance, with prompt design significantly influencing results. Overall, ReaLMistake offers a realistic, objective, and challenging framework for advancing error-detection methods and prompts further research into robust detectors, with code and data openly available.

Abstract

With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g., word sorting) or limited error types (e.g., faithfulness in summarization). This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs. ReaLMistake contains three challenging and meaningful tasks that introduce objectively assessable errors in four categories (reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge), eliciting naturally observed and diverse errors in responses of GPT-4 and Llama 2 70B annotated by experts. We use ReaLMistake to evaluate error detectors based on 12 LLMs. Our findings show: 1) Top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans. 2) Explanations by LLM-based error detectors lack reliability. 3) LLMs-based error detection is sensitive to small changes in prompts but remains challenging to improve. 4) Popular approaches to improving LLMs, including self-consistency and majority vote, do not improve the error detection performance. Our benchmark and code are provided at https://github.com/psunlpgroup/ReaLMistake.
Paper Structure (61 sections, 11 figures, 10 tables)

This paper contains 61 sections, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Left: Tasks in existing LLM evaluation benchmarks are often subjective and not suitable for collecting errors made by LLMs for the purpose of evaluating binary error detection methods. Right: We introduce the ReaLMistake benchmark with realistic, objective, and diverse errors made by LLMs for evaluating error detection.
  • Figure 2: Examples of three tasks in ReaLMistake with four error categories. Each instance includes a binary error label, error categories, and annotator's explanations about errors on a response from GPT-4-0613 or Llama 2 70B. Appendix \ref{['appendix:examples-full']} provides full details.
  • Figure 3: Creation processes of three tasks in ReaLMistake. Details are in Appendix \ref{['appendix:dataset-creation']}.
  • Figure 3: Math Word Problem Generation task consists of diverse requirements in 9 categories.
  • Figure 4: Error detection performance of 12 LLMs with zero-shot prompts on ReaLMistake. This table shows the average performance on four prompts in Section \ref{['sec:bias-in-error-detection']}. "Random" baseline predicts each instance as an error in the same probability as the frequency of the error labels for each dataset. Human performance is evaluated on 35 cases in each setting. RGB]211,211,211Gray color represents the values worse than the random baseline.
  • ...and 6 more figures