Table of Contents
Fetching ...

To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity

Anastasiia Sedova, Robert Litschko, Diego Frassinelli, Benjamin Roth, Barbara Plank

TL;DR

Analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities reveals that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts.

Abstract

One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.

To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity

TL;DR

Analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities reveals that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts.

Abstract

One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.
Paper Structure (24 sections, 4 figures, 6 tables)

This paper contains 24 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of our four studies on LLMs' self-consistency using prompts with ambiguous entities. Colors indicate preferred (green) and alternative (red) readings implied in the query or adopted by the model.
  • Figure 2: Models preferred readings discovered in Study 2 (blue for non-company, yellow for company; e.g., all analyzed models prefer the 'company' reading for entities from the people category).
  • Figure 3: Results of Study 4 (% of entities). "Consistent" entities are those for which the model reaffirmed all provided information in Study 3. "Partially Consistent" entities are those where some information was reaffirmed but not all, while "Inconsistent" entities are those for which all previously provided information was denied. The exact numbers are provided in Appendix \ref{['app:studies']} (Table \ref{['app:tab:study_4']}).
  • Figure 4: Popularity distribution of company and non-company readings of all 49 entities involved in our studies.