Table of Contents
Fetching ...

Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models

Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, Masoud Hashemi

TL;DR

A black-box evaluation approach and a new dataset, Abstain-QA, crafted to rigorously assess abstention Ability are introduced and a new confusion matrix, the ''Answerable-Unanswerable Confusion Matrix'' (AUCM) is proposed which serves as the basis for evaluating AA.

Abstract

Abstention Ability (AA) is a critical aspect of Large Language Model (LLM) reliability, referring to an LLM's capability to withhold responses when uncertain or lacking a definitive answer, without compromising performance. Although previous studies have attempted to improve AA, they lack a standardised evaluation method and remain unsuitable for black-box models where token prediction probabilities are inaccessible. This makes comparative analysis challenging, especially for state-of-the-art closed-source commercial LLMs. This paper bridges this gap by introducing a black-box evaluation approach and a new dataset, Abstain-QA, crafted to rigorously assess AA across varied question types (answerable and unanswerable), domains (well-represented and under-represented), and task types (fact centric and reasoning). We also propose a new confusion matrix, the ''Answerable-Unanswerable Confusion Matrix'' (AUCM) which serves as the basis for evaluating AA, by offering a structured and precise approach for assessment. Finally, we explore the impact of three prompting strategies-Strict Prompting, Verbal Confidence Thresholding, and Chain-of-Thought (CoT)-on improving AA. Our results indicate that even powerful models like GPT-4, Mixtral 8x22b encounter difficulties with abstention; however, strategic approaches such as Strict prompting and CoT can enhance this capability.

Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models

TL;DR

A black-box evaluation approach and a new dataset, Abstain-QA, crafted to rigorously assess abstention Ability are introduced and a new confusion matrix, the ''Answerable-Unanswerable Confusion Matrix'' (AUCM) is proposed which serves as the basis for evaluating AA.

Abstract

Abstention Ability (AA) is a critical aspect of Large Language Model (LLM) reliability, referring to an LLM's capability to withhold responses when uncertain or lacking a definitive answer, without compromising performance. Although previous studies have attempted to improve AA, they lack a standardised evaluation method and remain unsuitable for black-box models where token prediction probabilities are inaccessible. This makes comparative analysis challenging, especially for state-of-the-art closed-source commercial LLMs. This paper bridges this gap by introducing a black-box evaluation approach and a new dataset, Abstain-QA, crafted to rigorously assess AA across varied question types (answerable and unanswerable), domains (well-represented and under-represented), and task types (fact centric and reasoning). We also propose a new confusion matrix, the ''Answerable-Unanswerable Confusion Matrix'' (AUCM) which serves as the basis for evaluating AA, by offering a structured and precise approach for assessment. Finally, we explore the impact of three prompting strategies-Strict Prompting, Verbal Confidence Thresholding, and Chain-of-Thought (CoT)-on improving AA. Our results indicate that even powerful models like GPT-4, Mixtral 8x22b encounter difficulties with abstention; however, strategic approaches such as Strict prompting and CoT can enhance this capability.
Paper Structure (21 sections, 2 equations, 5 figures, 3 tables)

This paper contains 21 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: With Abstain-QA, we assess the Abstention Ability ($\mathcal{AA}$) of models in different categories of 'Question Types', 'Domains' or 'Data Domains', and 'Task Types'. The selection of any combination from each of these categories aims to challenge the model across different types of information and cognitive demands.
  • Figure 2: We introduce 'Answerable-Unanswerable Confusion Matrix (AUCM)' as a tailored approach to accurately quantify a model's abstention ability (section \ref{['sec: evaluation methodology']}). This matrix contrasts the types of model predictions (model answered or abstained) with the questions type (answerable or unanswerable), to capture all potential outcomes.
  • Figure 3: (a) and (b) depict an Answerable and an Unanswerable sample respectively, from the Carnatic-QA dataset which consists of samples from an Under-represented domain called Carnatic Music. The bold option in both figures represent the correct answer.
  • Figure 4: A demonstration of the impact of introducing Abstain and Extreme Abstain clauses (appendix \ref{['app: examples abstain clause variations']}) on the final answer of GPT-4 32k. The example is from Pop QA, in the Verbal confidence setup. With the standard clause, GPT-4-32K gives (D) as the predicted answer, which is incorrect. Whereas, with both Abstain and Extreme Abstain clauses, the model changes its answer to the correct option (E).
  • Figure 5: (a) Abstain Clause - An illustration of the $\mathcal{AC}$ utilised in all three experiments. (b) Extreme Abstain Clause - The top figure illustrates the $\mathcal{EAC}$ used in the Base and Verbal Confidence experiments, while the bottom figure presents an alternate version used in the Chain of Thought experiment.