Table of Contents
Fetching ...

Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models

Yuheng Wu, Wentao Guo, Zirui Liu, Heng Ji, Zhaozhuo Xu, Denghui Zhang

TL;DR

The paper investigates how Theory-of-Mind (ToM) capabilities emerge in large language models by identifying extremely sparse, ToM-sensitive parameter patterns. It introduces a Hessian/Fisher-based framework to isolate a tiny subset of parameters that, when perturbed, substantially impair ToM while largely preserving perplexity, with the most pronounced effects in RoPE-based models. The authors link these ToM-sensitive patterns to positional encoding by showing that perturbations disrupt dominant-frequency activations and shift attention sinks, thereby degrading contextual localization and language understanding. The findings reveal architecture-specific mechanisms—RoPE-based models rely on frequency-structured activations and BOS-related attention dynamics—offering implications for model alignment, bias mitigation, and designing interactions that rely on social reasoning. Collectively, the work bridges interpretability, cognitive science, and robust AI design by mapping a sparse parameter geometry to ToM-related behaviors and attention dynamics.

Abstract

This paper investigates the emergence of Theory-of-Mind (ToM) capabilities in large language models (LLMs) from a mechanistic perspective, focusing on the role of extremely sparse parameter patterns. We introduce a novel method to identify ToM-sensitive parameters and reveal that perturbing as little as 0.001% of these parameters significantly degrades ToM performance while also impairing contextual localization and language understanding. To understand this effect, we analyze their interaction with core architectural components of LLMs. Our findings demonstrate that these sensitive parameters are closely linked to the positional encoding module, particularly in models using Rotary Position Embedding (RoPE), where perturbations disrupt dominant-frequency activations critical for contextual processing. Furthermore, we show that perturbing ToM-sensitive parameters affects LLM's attention mechanism by modulating the angle between queries and keys under positional encoding. These insights provide a deeper understanding of how LLMs acquire social reasoning abilities, bridging AI interpretability with cognitive science. Our results have implications for enhancing model alignment, mitigating biases, and improving AI systems designed for human interaction.

Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models

TL;DR

The paper investigates how Theory-of-Mind (ToM) capabilities emerge in large language models by identifying extremely sparse, ToM-sensitive parameter patterns. It introduces a Hessian/Fisher-based framework to isolate a tiny subset of parameters that, when perturbed, substantially impair ToM while largely preserving perplexity, with the most pronounced effects in RoPE-based models. The authors link these ToM-sensitive patterns to positional encoding by showing that perturbations disrupt dominant-frequency activations and shift attention sinks, thereby degrading contextual localization and language understanding. The findings reveal architecture-specific mechanisms—RoPE-based models rely on frequency-structured activations and BOS-related attention dynamics—offering implications for model alignment, bias mitigation, and designing interactions that rely on social reasoning. Collectively, the work bridges interpretability, cognitive science, and robust AI design by mapping a sparse parameter geometry to ToM-related behaviors and attention dynamics.

Abstract

This paper investigates the emergence of Theory-of-Mind (ToM) capabilities in large language models (LLMs) from a mechanistic perspective, focusing on the role of extremely sparse parameter patterns. We introduce a novel method to identify ToM-sensitive parameters and reveal that perturbing as little as 0.001% of these parameters significantly degrades ToM performance while also impairing contextual localization and language understanding. To understand this effect, we analyze their interaction with core architectural components of LLMs. Our findings demonstrate that these sensitive parameters are closely linked to the positional encoding module, particularly in models using Rotary Position Embedding (RoPE), where perturbations disrupt dominant-frequency activations critical for contextual processing. Furthermore, we show that perturbing ToM-sensitive parameters affects LLM's attention mechanism by modulating the angle between queries and keys under positional encoding. These insights provide a deeper understanding of how LLMs acquire social reasoning abilities, bridging AI interpretability with cognitive science. Our results have implications for enhancing model alignment, mitigating biases, and improving AI systems designed for human interaction.

Paper Structure

This paper contains 63 sections, 10 equations, 22 figures, 6 tables.

Figures (22)

  • Figure 1: A ToM task from kosinski_2024_evaluating. In Question (a), LLMs should fill in the blank with "popcorn." In Question (b), the blank should be filled with "chocolate."
  • Figure 2: Illustration of the mask generation method. The diagonal elements $H_{ii}$ are reshaped according to the weight matrix shape to identify sensitive parameters.
  • Figure 3: Activation calculations. (a) Original. We observe dominant-frequency activations introduced by RoPE. (b) Perturbing ToM-sensitive parameters (the squares with red diagonal lines in $\mathbf{W}'$). We observe that the ToM parameter pattern is highly frequency-sensitive and specifically affects dominant-frequency activations.
  • Figure 4: Visualization of the vector relationships between $\mathbf{q}$ and $\mathbf{k}_\text{BOS}$, as well as between $\mathbf{q}$ and other tokens in $\mathbf{K}$, under both positional encoding and ToM perturbation.
  • Figure 5: Attention sink shift. Shifting pure attention sinks introduces incorrect attention relationships, while shifting partial attention sinks distorts the original attention patterns. Attention sink shift degrades the model's language understanding capabilities evaluated by MMLU.
  • ...and 17 more figures

Theorems & Definitions (1)

  • Definition 2.1: ToM-sensitive Parameters