Table of Contents
Fetching ...

Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method

Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Chong Meng, Shuaiqiang Wang, Zhicong Cheng, Zhaochun Ren, Dawei Yin

TL;DR

The paper tackles the problem of nonfactual responses in large language models by introducing a self-detection framework that does not rely on external data. It combines consistency-based detection from diversified verbalizations with a Verbalization-based atypicality measure, and validates the approach across multiple state-of-the-art models and tasks (factoid QA, commonsense, arithmetic). Key findings show that model responses diverge on semantically equivalent question paraphrases and that atypical verbalizations correlate with unknown knowledge, enabling improved detection (PR-AUC) and robust performance when combined with existing baselines. The work offers a practical, prompt-based method to monitor and improve LLM reliability, highlights limitations regarding diversification and consistently wrong answers, and suggests future integration with verifiers or external knowledge sources.

Abstract

Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks. However, recent literature reveals that LLMs generate nonfactual responses intermittently, which impedes the LLMs' reliability for further utilization. In this paper, we propose a novel self-detection method to detect which questions that a LLM does not know that are prone to generate nonfactual results. Specifically, we first diversify the textual expressions for a given question and collect the corresponding answers. Then we examine the divergencies between the generated answers to identify the questions that the model may generate falsehoods. All of the above steps can be accomplished by prompting the LLMs themselves without referring to any other external resources. We conduct comprehensive experiments and demonstrate the effectiveness of our method on recently released LLMs, e.g., Vicuna, ChatGPT, and GPT-4.

Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method

TL;DR

The paper tackles the problem of nonfactual responses in large language models by introducing a self-detection framework that does not rely on external data. It combines consistency-based detection from diversified verbalizations with a Verbalization-based atypicality measure, and validates the approach across multiple state-of-the-art models and tasks (factoid QA, commonsense, arithmetic). Key findings show that model responses diverge on semantically equivalent question paraphrases and that atypical verbalizations correlate with unknown knowledge, enabling improved detection (PR-AUC) and robust performance when combined with existing baselines. The work offers a practical, prompt-based method to monitor and improve LLM reliability, highlights limitations regarding diversification and consistently wrong answers, and suggests future integration with verifiers or external knowledge sources.

Abstract

Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks. However, recent literature reveals that LLMs generate nonfactual responses intermittently, which impedes the LLMs' reliability for further utilization. In this paper, we propose a novel self-detection method to detect which questions that a LLM does not know that are prone to generate nonfactual results. Specifically, we first diversify the textual expressions for a given question and collect the corresponding answers. Then we examine the divergencies between the generated answers to identify the questions that the model may generate falsehoods. All of the above steps can be accomplished by prompting the LLMs themselves without referring to any other external resources. We conduct comprehensive experiments and demonstrate the effectiveness of our method on recently released LLMs, e.g., Vicuna, ChatGPT, and GPT-4.
Paper Structure (37 sections, 4 equations, 4 figures, 14 tables, 1 algorithm)

This paper contains 37 sections, 4 equations, 4 figures, 14 tables, 1 algorithm.

Figures (4)

  • Figure 1: Two paradigms for detecting hallucinations. The dashed lines denote the LLM generation process. The solid lines denote non-factuality detection.
  • Figure 2: The framework of self-detecting what language models do not know.
  • Figure 3: The PR AUC when combining our method and previous proposed TokenProbs (T), Perplexity (P), ConsistAnswers (C), and SelfCheckGPT (S).
  • Figure 4: The performance of different numbers of diversified questions for the self-detection.