Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

Yuhong Sun; Zhangyue Yin; Qipeng Guo; Jiawen Wu; Xipeng Qiu; Hui Zhao

Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

Yuhong Sun, Zhangyue Yin, Qipeng Guo, Jiawen Wu, Xipeng Qiu, Hui Zhao

TL;DR

It is shown that utilizing MWP is a reliable and effective approach to assess hallucination and in-context learning and reinforcement learning with human feedback (RLHF) training significantly enhance the model’s ability to avoid hallucination.

Abstract

Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. However, they are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP). To support this approach, we innovatively develop a dataset called Unanswerable Math Word Problem (UMWP) which comprises 5200 questions across five categories. We developed an evaluation methodology combining text similarity and mathematical expression detection to determine whether LLM considers the question unanswerable. The results of extensive experiments conducted on 31 LLMs, including GPT-3, InstructGPT, LLaMA, and Claude, demonstrate that in-context learning and reinforcement learning with human feedback (RLHF) training significantly enhance the model's ability to avoid hallucination. We show that utilizing MWP is a reliable and effective approach to assess hallucination. Our code and data are available at https://github.com/Yuki-Asuuna/UMWP.

Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

TL;DR

Abstract

Paper Structure (28 sections, 1 equation, 8 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 1 equation, 8 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Math Word Problem Benchmark
Mathematical Ability of LLM
Hallucination Benchmark
Dataset Construction
Unanswerable Question
Answerable Question
Evaluation Method
Experiment
Setting
Human Benchmark
Set U Construction
Experiment Results Analysis
Model Size.
...and 13 more sections

Figures (8)

Figure 1: An example of hallucination towards a Math Word Problem(MWP).
Figure 2: An example of extracting variable expression from raw LLM output.
Figure 3: Experiment results from InstructGPT, Claude, and LLaMA series using three different input forms (Direct, Instruction, and ICL).
Figure 4: F1 score of LLMs in different series and human in the instruction input form.
Figure 5: Accuracy of the InstructGPT series in responding to answerable questions in the instruction input form.
...and 3 more figures

Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

TL;DR

Abstract

Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

Authors

TL;DR

Abstract

Table of Contents

Figures (8)