Table of Contents
Fetching ...

Competence-Based Analysis of Language Models

Adam Davies, Jize Jiang, ChengXiang Zhai

TL;DR

Competence-Based Analysis of Language Models (CALM) introduces a causality-inspired framework to quantify how linguistically interpretable representations underpin LLM behavior. By formalizing tasks with structural causal models and applying gradient-based interventions (GBIs) on internal representations, CALM measures a model's linguistic competence as alignment with ground-truth causal structure under interventions. The paper develops a novel Interchange Interventions–style metric C_T(M|G_T) and demonstrates GBIs on BERT and RoBERTa across 14 lexical-inference tasks from the LAMA ConceptNet suite, showing that competence, not just accuracy, clarifies model brittleness and generalization under distribution shifts. These insights offer a principled path to diagnose robustness, guide model design, and anticipate behavior under prompt variation, with broad implications for interpretable AI and reliable language understanding.

Abstract

Despite the recent successes of large, pretrained neural language models (LLMs), comparatively little is known about the representations of linguistic structure they learn during pretraining, which can lead to unexpected behaviors in response to prompt variation or distribution shift. To better understand these models and behaviors, we introduce a general model analysis framework to study LLMs with respect to their representation and use of human-interpretable linguistic properties. Our framework, CALM (Competence-based Analysis of Language Models), is designed to investigate LLM competence in the context of specific tasks by intervening on models' internal representations of different linguistic properties using causal probing, and measuring models' alignment under these interventions with a given ground-truth causal model of the task. We also develop a new approach for performing causal probing interventions using gradient-based adversarial attacks, which can target a broader range of properties and representations than prior techniques. Finally, we carry out a case study of CALM using these interventions to analyze and compare LLM competence across a variety of lexical inference tasks, showing that CALM can be used to explain behaviors across these tasks.

Competence-Based Analysis of Language Models

TL;DR

Competence-Based Analysis of Language Models (CALM) introduces a causality-inspired framework to quantify how linguistically interpretable representations underpin LLM behavior. By formalizing tasks with structural causal models and applying gradient-based interventions (GBIs) on internal representations, CALM measures a model's linguistic competence as alignment with ground-truth causal structure under interventions. The paper develops a novel Interchange Interventions–style metric C_T(M|G_T) and demonstrates GBIs on BERT and RoBERTa across 14 lexical-inference tasks from the LAMA ConceptNet suite, showing that competence, not just accuracy, clarifies model brittleness and generalization under distribution shifts. These insights offer a principled path to diagnose robustness, guide model design, and anticipate behavior under prompt variation, with broad implications for interpretable AI and reliable language understanding.

Abstract

Despite the recent successes of large, pretrained neural language models (LLMs), comparatively little is known about the representations of linguistic structure they learn during pretraining, which can lead to unexpected behaviors in response to prompt variation or distribution shift. To better understand these models and behaviors, we introduce a general model analysis framework to study LLMs with respect to their representation and use of human-interpretable linguistic properties. Our framework, CALM (Competence-based Analysis of Language Models), is designed to investigate LLM competence in the context of specific tasks by intervening on models' internal representations of different linguistic properties using causal probing, and measuring models' alignment under these interventions with a given ground-truth causal model of the task. We also develop a new approach for performing causal probing interventions using gradient-based adversarial attacks, which can target a broader range of properties and representations than prior techniques. Finally, we carry out a case study of CALM using these interventions to analyze and compare LLM competence across a variety of lexical inference tasks, showing that CALM can be used to explain behaviors across these tasks.
Paper Structure (45 sections, 6 equations, 5 figures)

This paper contains 45 sections, 6 equations, 5 figures.

Figures (5)

  • Figure 1: Structural causal model (SCM) of task ${\mathcal{T}}$'s data-generating process and how it may be performed by model $M$. Shaded and white nodes denote observed and unobserved variables, respectively. In CALM, the goal is to determine which representations $Z_j = z_j$ are causally implicated in $M$'s predictions $\hat{\mathbf{y}}$.
  • Figure 2: SCM of a competent English speaker on the hypernym prediction task. Shaded and white nodes denote observed and unobserved variables, respectively.
  • Figure 3: Gradient-Based Interventions. Input tokens $\mathbf{x} = (x_1, ..., x_{|\mathbf{x}|})$ are passed through layers $L = 1, ..., l$, where embedding $\mathbf{h}_i^l$ (encoding the value $Z = z$) is extracted from layer $l$ and given to $g_Z$ as input. Next, the embedding is modified by gradient-based attacks on $g_Z$ to encode the counterfactual value $Z = z'$, then fed back into subsequent layers $L = l+1, ..., |L|$ and language modeling head $f_{\text{LM}}$ to obtain the intervened predictions $M(\mathbf{x} | \mathop{\mathrm{do}}\limits(Z = z'))$.
  • Figure 4: Performance (left) and competence (right) of BERT (left bars) and RoBERTa (right bars) for all tasks, using FGSM with $\epsilon = 0.1$. In the competence plot, y-values are the average competence score and error bars are the maximum and minimum competence score, as measured over 10 experimental iterations (each with a different randomly-initialized probe $g_Z$).
  • Figure 5: Competence of BERT (left bars) and RoBERTa (right bars) for all tasks, using PGD with $\epsilon = 0.1$. Y-values are the average competence score and error bars are the maximum and minimum competence score, as measured over 10 experimental iterations (each with a different randomly-initialized probe $g_Z$).