Table of Contents
Fetching ...

Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot

TL;DR

The paper tackles the problem of faithfulness in large language model self-explanations, arguing that plausible free-text explanations often do not reflect the actual reasoning. It introduces NeuroFaith, a framework that directly links self-NLE concepts to mechanistic findings in the model's internal representations, using concept extraction, circuit-based interpretation, and a quantified faithfulness score F(x,e). The authors instantiate NeuroFaith for 2-hop reasoning and classification, showing that faithfulness correlates with accuracy and model size, and that a linear structure in representation space permits both detection and steering-based enhancement of faithfulness. The work demonstrates practical pathways to more trustworthy AI by enabling faithful explanations and real-time improvement via activation steering, while acknowledging biases in concept extraction and the need for broader applicability. These insights offer a principled route toward transparent reasoning in LLMs and pave the way for extensions to more complex chain-of-thought scenarios.

Abstract

Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model's actual reasoning process, indicating a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model's predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, a linear faithfulness probe based on NeuroFaith is developed to detect unfaithful self-explanations from representation space and improve faithfulness through steering. NeuroFaith provides a principled approach to evaluating and enhancing the faithfulness of LLM free text self-explanations, addressing critical needs for trustworthy AI systems.

Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

TL;DR

The paper tackles the problem of faithfulness in large language model self-explanations, arguing that plausible free-text explanations often do not reflect the actual reasoning. It introduces NeuroFaith, a framework that directly links self-NLE concepts to mechanistic findings in the model's internal representations, using concept extraction, circuit-based interpretation, and a quantified faithfulness score F(x,e). The authors instantiate NeuroFaith for 2-hop reasoning and classification, showing that faithfulness correlates with accuracy and model size, and that a linear structure in representation space permits both detection and steering-based enhancement of faithfulness. The work demonstrates practical pathways to more trustworthy AI by enabling faithful explanations and real-time improvement via activation steering, while acknowledging biases in concept extraction and the need for broader applicability. These insights offer a principled route toward transparent reasoning in LLMs and pave the way for extensions to more complex chain-of-thought scenarios.

Abstract

Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model's actual reasoning process, indicating a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model's predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, a linear faithfulness probe based on NeuroFaith is developed to detect unfaithful self-explanations from representation space and improve faithfulness through steering. NeuroFaith provides a principled approach to evaluating and enhancing the faithfulness of LLM free text self-explanations, addressing critical needs for trustworthy AI systems.

Paper Structure

This paper contains 60 sections, 7 equations, 44 figures, 12 tables.

Figures (44)

  • Figure 1: NeuroFaith overview. NeuroFaith (1) extracting concepts from the self-NLE and (2) assessing the mechanistic influence of these concepts to finally (3) measure faithfulness.
  • Figure 2: Majority vote faithfulness linear probe performance across models and datasets.
  • Figure 3: Linear vectors max. cosine similarity on gemma-2-27b
  • Figure 4: Faithfulness linear probe visualization examples before ($e$) and after ($e_{steer}$) faithfulness steering. Red: unfaithful activations; Green: faithful activations.
  • Figure 5: Concepts related to faithful self-NLE and the prediction "business", sorted by frequency for AGNews for gemma-2-2b.
  • ...and 39 more figures