Table of Contents
Fetching ...

Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection

Moxin Li, Wenjie Wang, Fuli Feng, Fengbin Zhu, Qifan Wang, Tat-Seng Chua

TL;DR

This work introduces a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer, and then aggregates the justifications for comprehensive target answer evaluation, which can be seamlessly integrated with existing approaches for superior self-detection.

Abstract

Self-detection for Large Language Models (LLMs) seeks to evaluate the trustworthiness of the LLM's output by leveraging its own capabilities, thereby alleviating the issue of output hallucination. However, existing self-detection approaches only retrospectively evaluate answers generated by LLM, typically leading to the over-trust in incorrectly generated answers. To tackle this limitation, we propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers. It thoroughly compares the trustworthiness of multiple candidate answers to mitigate the over-trust in LLM-generated incorrect answers. Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer, and then aggregates the justifications for comprehensive target answer evaluation. This framework can be seamlessly integrated with existing approaches for superior self-detection. Extensive experiments on six datasets spanning three tasks demonstrate the effectiveness of the proposed framework.

Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection

TL;DR

This work introduces a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer, and then aggregates the justifications for comprehensive target answer evaluation, which can be seamlessly integrated with existing approaches for superior self-detection.

Abstract

Self-detection for Large Language Models (LLMs) seeks to evaluate the trustworthiness of the LLM's output by leveraging its own capabilities, thereby alleviating the issue of output hallucination. However, existing self-detection approaches only retrospectively evaluate answers generated by LLM, typically leading to the over-trust in incorrectly generated answers. To tackle this limitation, we propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers. It thoroughly compares the trustworthiness of multiple candidate answers to mitigate the over-trust in LLM-generated incorrect answers. Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer, and then aggregates the justifications for comprehensive target answer evaluation. This framework can be seamlessly integrated with existing approaches for superior self-detection. Extensive experiments on six datasets spanning three tasks demonstrate the effectiveness of the proposed framework.
Paper Structure (38 sections, 9 equations, 6 figures, 14 tables)

This paper contains 38 sections, 9 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: An illustration of Think Twice before Trusting framework for mitigating the over-trust issue in LLM self-detection. LLM is instructed to reflect and generate justification on the trustworthiness of each answers before evaluating the trustworthiness of the target answer.
  • Figure 2: Two existing paradigms of self-detection and our new comprehensive answer evaluation paradigm.
  • Figure 3: Comparison of self-detection methods on CAD. w/ cf denotes our strategy with counterfactual data. The AUROC values are shown in the x-axis. The boxes on the left and right represent the detection scores of incorrect and correct instances, respectively.
  • Figure 4: Visualization of bias mitigation effect of $T^3$ which largely reduces the detection score overlaps between correct (right) and incorrect (left) instances.
  • Figure 5: Accuracy improvement of selective prediction on $T^3$ detection scores.
  • ...and 1 more figures