Table of Contents
Fetching ...

A Survey on the Honesty of Large Language Models

Siheng Li, Cheng Yang, Taiqiang Wu, Chufan Shi, Yuji Zhang, Xinyu Zhu, Zesen Cheng, Deng Cai, Mo Yu, Lemao Liu, Jie Zhou, Yujiu Yang, Ngai Wong, Xixin Wu, Wai Lam

TL;DR

This survey addresses the problem of honesty in large language models by clarifying definitions and proposing a structured evaluation framework around self-knowledge and self-expression. It reviews model-agnostic and model-specific benchmarks, and identities metrics for recognition of known/unknown, calibration, and selective prediction, alongside evaluation for self-expression through identification-based and identification-free approaches. The paper统一 discusses a wide range of improvement strategies, spanning training-free and training-based methods for both self-knowledge and self-expression, including prompting, prompting-based elicitation, decoding-time interventions, RLHF, and probing. It also outlines future directions such as objective vs subjective honesty, knowledge identification, honesty in instruction-following, in-context knowledge, and extending studies to diverse model families.

Abstract

Honesty is a fundamental principle for aligning large language models (LLMs) with human values, requiring these models to recognize what they know and don't know and be able to faithfully express their knowledge. Despite promising, current LLMs still exhibit significant dishonest behaviors, such as confidently presenting wrong answers or failing to express what they know. In addition, research on the honesty of LLMs also faces challenges, including varying definitions of honesty, difficulties in distinguishing between known and unknown knowledge, and a lack of comprehensive understanding of related research. To address these issues, we provide a survey on the honesty of LLMs, covering its clarification, evaluation approaches, and strategies for improvement. Moreover, we offer insights for future research, aiming to inspire further exploration in this important area.

A Survey on the Honesty of Large Language Models

TL;DR

This survey addresses the problem of honesty in large language models by clarifying definitions and proposing a structured evaluation framework around self-knowledge and self-expression. It reviews model-agnostic and model-specific benchmarks, and identities metrics for recognition of known/unknown, calibration, and selective prediction, alongside evaluation for self-expression through identification-based and identification-free approaches. The paper统一 discusses a wide range of improvement strategies, spanning training-free and training-based methods for both self-knowledge and self-expression, including prompting, prompting-based elicitation, decoding-time interventions, RLHF, and probing. It also outlines future directions such as objective vs subjective honesty, knowledge identification, honesty in instruction-following, in-context knowledge, and extending studies to diverse model families.

Abstract

Honesty is a fundamental principle for aligning large language models (LLMs) with human values, requiring these models to recognize what they know and don't know and be able to faithfully express their knowledge. Despite promising, current LLMs still exhibit significant dishonest behaviors, such as confidently presenting wrong answers or failing to express what they know. In addition, research on the honesty of LLMs also faces challenges, including varying definitions of honesty, difficulties in distinguishing between known and unknown knowledge, and a lack of comprehensive understanding of related research. To address these issues, we provide a survey on the honesty of LLMs, covering its clarification, evaluation approaches, and strategies for improvement. Moreover, we offer insights for future research, aiming to inspire further exploration in this important area.
Paper Structure (41 sections, 10 equations, 6 figures, 5 tables)

This paper contains 41 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The outline of this survey.
  • Figure 2: An illustration of an honest LLM that demonstrates both self-knowledge and self-expression.
  • Figure 3: Illustrations of self-knowledge evaluation, encompassing the recognition of known/unknown, calibration, and selective prediction. "Conf" indicates the LLM's confidence score and "Acc" represents the accuracy of the response.
  • Figure 4: Illustrations of self-expression evaluation, encompassing both identification-based and identification-free approaches.
  • Figure 5: Improvement of self-knowledge, encompassing both training-based and training-free approaches.
  • ...and 1 more figures