Table of Contents
Fetching ...

GRADE: Probing Knowledge Gaps in LLMs through Gradient Subspace Dynamics

Yujing Wang, Yuanbang Liang, Yukun Lai, Hainan Zhang, Hanqi Yan

Abstract

Detecting whether a model's internal knowledge is sufficient to correctly answer a given question is a fundamental challenge in deploying responsible LLMs. In addition to verbalising the confidence by LLM self-report, more recent methods explore the model internals, such as the hidden states of the response tokens to capture how much knowledge is activated. We argue that such activated knowledge may not align with what the query requires, e.g., capturing the stylistic and length-related features that are uninformative for answering the query. To fill the gap, we propose GRADE (Gradient Dynamics for knowledge gap detection), which quantifies the knowledge gap via the cross-layer rank ratio of the gradient to that of the corresponding hidden state subspace. This is motivated by the property of gradients as estimators of the required knowledge updates for a given target. We validate \modelname{} on six benchmarks, demonstrating its effectiveness and robustness to input perturbations. In addition, we present a case study showing how the gradient chain can generate interpretable explanations of knowledge gaps for long-form answers.

GRADE: Probing Knowledge Gaps in LLMs through Gradient Subspace Dynamics

Abstract

Detecting whether a model's internal knowledge is sufficient to correctly answer a given question is a fundamental challenge in deploying responsible LLMs. In addition to verbalising the confidence by LLM self-report, more recent methods explore the model internals, such as the hidden states of the response tokens to capture how much knowledge is activated. We argue that such activated knowledge may not align with what the query requires, e.g., capturing the stylistic and length-related features that are uninformative for answering the query. To fill the gap, we propose GRADE (Gradient Dynamics for knowledge gap detection), which quantifies the knowledge gap via the cross-layer rank ratio of the gradient to that of the corresponding hidden state subspace. This is motivated by the property of gradients as estimators of the required knowledge updates for a given target. We validate \modelname{} on six benchmarks, demonstrating its effectiveness and robustness to input perturbations. In addition, we present a case study showing how the gradient chain can generate interpretable explanations of knowledge gaps for long-form answers.

Paper Structure

This paper contains 33 sections, 16 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: t-SNE visualizations of hidden states for answerable and unanswerable queries.
  • Figure 2: GRADE for knowledge gap probing. Given a input $q$, (i) Forward pass: compute hidden states $h,o$ in the MLP block and loss $\mathcal{L}$, either before response generation or after; Backward pass: derive gradient $g$; (ii) Rank ratio calculation: project gradients onto the subspace spanned by $h$ and compute rank ratios; (iii) Probe training: aggregate rank ratios across $L$ layers to predict the knowledge gap.
  • Figure 3: Person correlation between the input sequence length and the values of different rank-based metrics. The internals are extracted from Llama-3-8B-Instruct model with the HotpotQA as input. The rank ratio is the most robust metric to input length variation.
  • Figure 4: Relative change in detection accuracy ($\Delta Acc$) before and after input paraphrase. Smaller changes imply that the method is more robust to the perturbation.
  • Figure 5: Cross-dataset generalization accuracy heatmaps. The results within the red box are transferred among similar complex questions (single-hop), i.e., between TQA and NQ.
  • ...and 7 more figures