Table of Contents
Fetching ...

Can AI Assistants Know What They Don't Know?

Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wenwei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, Xipeng Qiu

TL;DR

The paper tackles the problem of AI assistants hallucinating factual errors by enabling them to know when they don't know and to express that uncertainty. It introduces a model-specific Idk dataset derived from TriviaQA and evaluates prompting, supervised fine-tuning, and preference-based optimization (DPO, BoN, PPO, HIR) to train assistants to refuse unknowns while accurately answering known questions. Results show substantial gains in Truthful rate after Idk-alignment, with BoN often delivering the best in-distribution performance and HIR offering flexible control for out-of-distribution data. The work demonstrates a practical path to reducing hallucinations in open-domain QA by integrating explicit refusals, especially as model size grows and response strategies become tunable.

Abstract

Recently, AI assistants based on large language models (LLMs) show surprising performance in many tasks, such as dialogue, solving math problems, writing code, and using tools. Although LLMs possess intensive world knowledge, they still make factual errors when facing some knowledge intensive tasks, like open-domain question answering. These untruthful responses from the AI assistant may cause significant risks in practical applications. We believe that an AI assistant's refusal to answer questions it does not know is a crucial method for reducing hallucinations and making the assistant truthful. Therefore, in this paper, we ask the question "Can AI assistants know what they don't know and express them through natural language?" To answer this question, we construct a model-specific "I don't know" (Idk) dataset for an assistant, which contains its known and unknown questions, based on existing open-domain question answering datasets. Then we align the assistant with its corresponding Idk dataset and observe whether it can refuse to answer its unknown questions after alignment. Experimental results show that after alignment with Idk datasets, the assistant can refuse to answer most its unknown questions. For questions they attempt to answer, the accuracy is significantly higher than before the alignment.

Can AI Assistants Know What They Don't Know?

TL;DR

The paper tackles the problem of AI assistants hallucinating factual errors by enabling them to know when they don't know and to express that uncertainty. It introduces a model-specific Idk dataset derived from TriviaQA and evaluates prompting, supervised fine-tuning, and preference-based optimization (DPO, BoN, PPO, HIR) to train assistants to refuse unknowns while accurately answering known questions. Results show substantial gains in Truthful rate after Idk-alignment, with BoN often delivering the best in-distribution performance and HIR offering flexible control for out-of-distribution data. The work demonstrates a practical path to reducing hallucinations in open-domain QA by integrating explicit refusals, especially as model size grows and response strategies become tunable.

Abstract

Recently, AI assistants based on large language models (LLMs) show surprising performance in many tasks, such as dialogue, solving math problems, writing code, and using tools. Although LLMs possess intensive world knowledge, they still make factual errors when facing some knowledge intensive tasks, like open-domain question answering. These untruthful responses from the AI assistant may cause significant risks in practical applications. We believe that an AI assistant's refusal to answer questions it does not know is a crucial method for reducing hallucinations and making the assistant truthful. Therefore, in this paper, we ask the question "Can AI assistants know what they don't know and express them through natural language?" To answer this question, we construct a model-specific "I don't know" (Idk) dataset for an assistant, which contains its known and unknown questions, based on existing open-domain question answering datasets. Then we align the assistant with its corresponding Idk dataset and observe whether it can refuse to answer its unknown questions after alignment. Experimental results show that after alignment with Idk datasets, the assistant can refuse to answer most its unknown questions. For questions they attempt to answer, the accuracy is significantly higher than before the alignment.
Paper Structure (39 sections, 7 equations, 9 figures, 3 tables)

This paper contains 39 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Knowledge quadrants of an AI assistant. "Unknowns" represents what the AI does not actually know. "Knowns" represents what the AI actually knows. "Known" represents what the AI believes it knows. "Unknown" represents what the AI believes it does not know.
  • Figure 2: Knowledge quadrants of AI assistants on the Idk dataset (Ik threshold=1.0). Ik-Ik represents the AI answers the questions correctly. Idk-Ik represents the AI knows the answer but refuses to respond to the question. Idk-Idk represents the AI answers the question incorrectly. Ik-Idk represents the AI doesn't know the answer and refuses to respond to the question. w/Idk-Prompting: Using prompting can transform certain Idk-Idk questions to Ik-Idk questions. w/Idk-SFT: Idk-SFT allows the model to refuse to answer more questions it does not know, but it also tends to make the model more convervative, leading to incorrect refusals to answer some questions that it actually knows. w/Idk-DPO: Using preference-aware optimization, like DPO, can alleviate the model's excessive conservatism and reduce the number of Idk-Ik questions.
  • Figure 3: Top: Construction process of the Idk dataset. Bottom: Construction process of preference pairs. The green response indicates a correct answer, the red response indicates an incorrect answer, and "I don't know" represents the template for refusal to answer.
  • Figure 4: Left: Variation in the proportions of Ik and Idk questions within the Idk datasets constructed based on different Ik thresholds. Right: The changes in Ik-Ik rate, Ik-Idk rate, and Truthful rate after conducting Idk-SFT with different Idk datasets.
  • Figure 5: Label distribution in the Idk dataset across different models.
  • ...and 4 more figures