Table of Contents
Fetching ...

Language Models Fail to Introspect About Their Knowledge of Language

Siyuan Song, Jennifer Hu, Kyle Mahowald

TL;DR

This paper investigates whether large language models possess introspective access to their own linguistic knowledge. Using a controlled setup with 21 open-source LLMs, it compares direct string-probability judgments to metalinguistic prompts across grammar and word prediction tasks, while rigorously accounting for model similarity. The authors introduce a formal measure of introspection based on cross-method alignment beyond similarity, and they find no evidence of privileged self-access: metalinguistic responses correlate with internal knowledge, but not more so for a model with itself than for similar models. The findings suggest that prompting can reveal information about linguistic knowledge, but should not be conflated with direct internal generalizations, carrying implications for linguistics research and the assessment of model capabilities.

Abstract

There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of standard introspective methods in linguistics to evaluate grammatical knowledge in models (e.g., asking "Is this sentence grammatical?"). We systematically investigate emergent introspection across 21 open-source LLMs, in two domains where introspection is of theoretical interest: grammatical knowledge and word prediction. Crucially, in both domains, a model's internal linguistic knowledge can be theoretically grounded in direct measurements of string probability. We then evaluate whether models' responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model's prompted responses predict its own string probabilities, beyond what would be predicted by another model with nearly identical internal knowledge. While both metalinguistic prompting and probability comparisons lead to high task accuracy, we do not find evidence that LLMs have privileged "self-access". By using general tasks, controlling for model similarity, and evaluating a wide range of open-source models, we show that LLMs cannot introspect, and add new evidence to the argument that prompted responses should not be conflated with models' linguistic generalizations.

Language Models Fail to Introspect About Their Knowledge of Language

TL;DR

This paper investigates whether large language models possess introspective access to their own linguistic knowledge. Using a controlled setup with 21 open-source LLMs, it compares direct string-probability judgments to metalinguistic prompts across grammar and word prediction tasks, while rigorously accounting for model similarity. The authors introduce a formal measure of introspection based on cross-method alignment beyond similarity, and they find no evidence of privileged self-access: metalinguistic responses correlate with internal knowledge, but not more so for a model with itself than for similar models. The findings suggest that prompting can reveal information about linguistic knowledge, but should not be conflated with direct internal generalizations, carrying implications for linguistics research and the assessment of model capabilities.

Abstract

There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of standard introspective methods in linguistics to evaluate grammatical knowledge in models (e.g., asking "Is this sentence grammatical?"). We systematically investigate emergent introspection across 21 open-source LLMs, in two domains where introspection is of theoretical interest: grammatical knowledge and word prediction. Crucially, in both domains, a model's internal linguistic knowledge can be theoretically grounded in direct measurements of string probability. We then evaluate whether models' responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model's prompted responses predict its own string probabilities, beyond what would be predicted by another model with nearly identical internal knowledge. While both metalinguistic prompting and probability comparisons lead to high task accuracy, we do not find evidence that LLMs have privileged "self-access". By using general tasks, controlling for model similarity, and evaluating a wide range of open-source models, we show that LLMs cannot introspect, and add new evidence to the argument that prompted responses should not be conflated with models' linguistic generalizations.

Paper Structure

This paper contains 41 sections, 1 equation, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Overview of our approach. (a,c) Example "direct" and "metalinguistic" evaluation in (a) Exp. 1 (grammaticality) and (c) Exp. 2 (word prediction). (b) We analyze the alignment between scores derived from direct and metalinguistic evaluation, both within and across models. (d) Potential patterns of alignment across different types of model pairs.
  • Figure 2: Validation of methods in Exp. 1. (a) Models achieve high accuracy under both Direct and Meta methods. Vertical lines separate models into bins of similar parameter counts. (b) $\Delta{\text{Meta}}\xspace_A \sim \Delta{\text{Direct}}\xspace_B$ Pearson $r$ (averaged across prompts) for each pair of models, excluding items where $>$95% of models gave the same answer.
  • Figure 3: No evidence for introspection in Exp. 1. $\Delta{\text{Meta}}\xspace_A \sim \Delta{\text{Direct}}\xspace_B$ average Pearson $r$ for each pair of models, versus two measures of model similarity: (a) manually designed ModelSimilarity features, and (b) empirically measured $\Delta{\text{Direct}}\xspace_A \sim \Delta{\text{Direct}}\xspace_B$ scores. Similarity generally predicts $\Delta{\text{Meta}}\xspace_A \sim \Delta{\text{Direct}}\xspace_B$, but we find no evidence for a "same model effect" consistent with introspection.
  • Figure 4: No evidence for introspection in Exp. 2. (a) $\Delta{\text{Meta}}\xspace_A \sim \Delta{\text{Direct}}\xspace_B$ correlation is not higher within than across models, for ModelSimilarity features. (b) No "same model effect" when predicting $\Delta{\text{Meta}}\xspace_A \sim \Delta{\text{Direct}}\xspace_B$ from $\Delta{\text{Direct}}\xspace_A \sim \Delta{\text{Direct}}\xspace_B$.
  • Figure 5: Consistency (measured by Cohen's $\kappa$) between metalinguistic judgments and probability measurements. (a) Exp. 1. (b) Exp. 2. Each line stands for a prompt in the experiment.
  • ...and 5 more figures