Table of Contents
Fetching ...

Interpreting Multi-Attribute Confounding through Numerical Attributes in Large Language Models

Hirohane Takagi, Gouki Minegishi, Shota Kizawa, Issey Sukeda, Hitomi Yanaka

TL;DR

Using Partial Least Squares (PLS) regression on hidden states $X \in \mathbb{R}^{n\times h}$ to derive a low-dimensional subspace $Z = XW \in \mathbb{R}^{n\times k}$ and predictions $\hat{Y}$, this paper probes how Large Language Models encode multiple numerical attributes and responds to irrelevant numerical context via Spearman partial correlations. Across four transformer LLMs, the authors show LLMs preserve real-world numerical correlations but tend to amplify them, with inter-attribute subspaces overlapping and exhibiting asymmetric interference. They also demonstrate that irrelevant numerical prompts drift internal representations and that perturbations propagate differently by model size, with smaller models being more susceptible to prompt-induced bias. The work highlights a vulnerability in numerically sensitive decision making and provides a representation-aware framework for designing fairer prompts and mitigation strategies in numerically entangled contexts, guiding future efforts in robustness and interpretability of LLMs in high-stakes numerical tasks.

Abstract

Although behavioral studies have documented numerical reasoning errors in large language models (LLMs), the underlying representational mechanisms remain unclear. We hypothesize that numerical attributes occupy shared latent subspaces and investigate two questions:(1) How do LLMs internally integrate multiple numerical attributes of a single entity? (2)How does irrelevant numerical context perturb these representations and their downstream outputs? To address these questions, we combine linear probing with partial correlation analysis and prompt-based vulnerability tests across models of varying sizes. Our results show that LLMs encode real-world numerical correlations but tend to systematically amplify them. Moreover, irrelevant context induces consistent shifts in magnitude representations, with downstream effects that vary by model size. These findings reveal a vulnerability in LLM decision-making and lay the groundwork for fairer, representation-aware control under multi-attribute entanglement.

Interpreting Multi-Attribute Confounding through Numerical Attributes in Large Language Models

TL;DR

Using Partial Least Squares (PLS) regression on hidden states to derive a low-dimensional subspace and predictions , this paper probes how Large Language Models encode multiple numerical attributes and responds to irrelevant numerical context via Spearman partial correlations. Across four transformer LLMs, the authors show LLMs preserve real-world numerical correlations but tend to amplify them, with inter-attribute subspaces overlapping and exhibiting asymmetric interference. They also demonstrate that irrelevant numerical prompts drift internal representations and that perturbations propagate differently by model size, with smaller models being more susceptible to prompt-induced bias. The work highlights a vulnerability in numerically sensitive decision making and provides a representation-aware framework for designing fairer prompts and mitigation strategies in numerically entangled contexts, guiding future efforts in robustness and interpretability of LLMs in high-stakes numerical tasks.

Abstract

Although behavioral studies have documented numerical reasoning errors in large language models (LLMs), the underlying representational mechanisms remain unclear. We hypothesize that numerical attributes occupy shared latent subspaces and investigate two questions:(1) How do LLMs internally integrate multiple numerical attributes of a single entity? (2)How does irrelevant numerical context perturb these representations and their downstream outputs? To address these questions, we combine linear probing with partial correlation analysis and prompt-based vulnerability tests across models of varying sizes. Our results show that LLMs encode real-world numerical correlations but tend to systematically amplify them. Moreover, irrelevant context induces consistent shifts in magnitude representations, with downstream effects that vary by model size. These findings reveal a vulnerability in LLM decision-making and lay the groundwork for fairer, representation-aware control under multi-attribute entanglement.

Paper Structure

This paper contains 43 sections, 1 equation, 11 figures, 1 table.

Figures (11)

  • Figure 1: Overview of our approach to analyzing internal representations in LLMs by addressing two research questions (RQs): RQ1 examines how LLMs represent entities with multiple correlated numerical properties (e.g., San Diego’s population and area). RQ2 investigates how irrelevant numerical details in prompts influence these internal representations.
  • Figure 2: Correlation matrices for human (top) and geographical (bottom) entities. The year attributes of human entities or some attributes of geographical entities are likely to be correlated (with significance: *$p<0.05$, **$p<0.01$, ***$p<0.001$).
  • Figure 3: Spearman correlations for Llama 3.1 8B (top) and Qwen2.5-32B (bottom): diagonal is within‐attribute, while off-diagonal is inter-attribute. The attributes of human entities can be predicted across attributes. Furthermore, despite the fact that the diagonal components of geographical entities are less than one and reproduction within attributes is incomplete, area, population, and latitude exhibit unnatural cross-attribute correlations.
  • Figure 4: Layer-wise apparent Spearman correlation $r_s(\hat{Y}_t,Y_t)$ (blue), attribute fidelity $r_s(\hat{Y}_s,Y_s \mid Y_t)$ (orange), and attribute contamination $r_s(\hat{Y}_t, Y_t \mid Y_s)$ (green) for Llama 3.1 8B and Qwen2.5-32B, shown for (birthyear, workperiodstart) and (area, population) pairs. For each column, the upper shows high source-attribute fidelity and low contamination by the target attribute, while the lower side is the opposite with relatively high contamination and low fidelity. Each tick on the horizontal axis corresponds to a Transformer layer from which both the source and target attribute representations were extracted.
  • Figure 5: Correlations between the model outputs and the reference means in prompts. All models show higher correlations as the number of few-shot examples increases, with smaller models being more susceptible.
  • ...and 6 more figures