Table of Contents
Fetching ...

Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?

Yongjie Wang, Yibo Wang, Xin Zhou, Zhiqi Shen

TL;DR

This work addresses how dataset suitability for probe training relates to the generative uncertainty of LLMs. It couples a repeat-prompt paradigm to estimate response variability with segment-wise linear probes applied to deep latent representations, revealing a strong negative link between response uncertainty and probe accuracy. Through AttnLRP-based feature attribution, the authors show that higher uncertainty spreads relevance across more features, complicating linear probing, while low-uncertainty cases align with human knowledge in embeddings. The findings suggest a unified view where response uncertainty and probe effectiveness reflect shared internal representations, offering a lightweight diagnostic and guidance for interpretability efforts in LLMs.

Abstract

Probing techniques have shown promise in revealing how LLMs encode human-interpretable concepts, particularly when applied to curated datasets. However, the factors governing a dataset's suitability for effective probe training are not well-understood. This study hypothesizes that probe performance on such datasets reflects characteristics of both the LLM's generated responses and its internal feature space. Through quantitative analysis of probe performance and LLM response uncertainty across a series of tasks, we find a strong correlation: improved probe performance consistently corresponds to a reduction in response uncertainty, and vice versa. Subsequently, we delve deeper into this correlation through the lens of feature importance analysis. Our findings indicate that high LLM response variance is associated with a larger set of important features, which poses a greater challenge for probe models and often results in diminished performance. Moreover, leveraging the insights from response uncertainty analysis, we are able to identify concrete examples where LLM representations align with human knowledge across diverse domains, offering additional evidence of interpretable reasoning in LLMs.

Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?

TL;DR

This work addresses how dataset suitability for probe training relates to the generative uncertainty of LLMs. It couples a repeat-prompt paradigm to estimate response variability with segment-wise linear probes applied to deep latent representations, revealing a strong negative link between response uncertainty and probe accuracy. Through AttnLRP-based feature attribution, the authors show that higher uncertainty spreads relevance across more features, complicating linear probing, while low-uncertainty cases align with human knowledge in embeddings. The findings suggest a unified view where response uncertainty and probe effectiveness reflect shared internal representations, offering a lightweight diagnostic and guidance for interpretability efforts in LLMs.

Abstract

Probing techniques have shown promise in revealing how LLMs encode human-interpretable concepts, particularly when applied to curated datasets. However, the factors governing a dataset's suitability for effective probe training are not well-understood. This study hypothesizes that probe performance on such datasets reflects characteristics of both the LLM's generated responses and its internal feature space. Through quantitative analysis of probe performance and LLM response uncertainty across a series of tasks, we find a strong correlation: improved probe performance consistently corresponds to a reduction in response uncertainty, and vice versa. Subsequently, we delve deeper into this correlation through the lens of feature importance analysis. Our findings indicate that high LLM response variance is associated with a larger set of important features, which poses a greater challenge for probe models and often results in diminished performance. Moreover, leveraging the insights from response uncertainty analysis, we are able to identify concrete examples where LLM representations align with human knowledge across diverse domains, offering additional evidence of interpretable reasoning in LLMs.

Paper Structure

This paper contains 22 sections, 3 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Framework for correlating probe performance with response uncertainty.
  • Figure 2: Trend between response uncertainty and probe performance on the Llama 3.1 (8B) model over all six datasets.
  • Figure 3: The correlation analysis with different temperatures in generation for Llama 3.1 (8B). 'Var', '$K$', and 'Sp' denotes variance (uncertainty estimator), Kendall and Spearman rank correlation coefficients. Here, we use absolute value for good visualization.
  • Figure 4: Correlation coefficient vs sliding window parameters, Llama 3.1 (8B) on Figures.
  • Figure 5: Llama 3.1 (8B) model. We gradually remove unimportant features to LLM responses and observe the probe performance drop.
  • ...and 7 more figures