Table of Contents
Fetching ...

LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder

Yi Jing, Zijun Yao, Hongzhu Guo, Lingxu Ran, Xiaozhi Wang, Lei Hou, Juanzi Li

TL;DR

LinguaLens tackles the opacity of linguistic representations in large language models by introducing a sparse auto-encoder–based framework that decomposes hidden states into interpretable linguistic features across morphology, syntax, semantics, and pragmatics in both English and Chinese. It builds a large-scale counterfactual linguistic dataset and defines causal metrics, enabling feature extraction and intervention that validate representations and support controllable outputs. Experiments on the Llama-3.1-8B model show systematic discovery of features, cross-layer organization, and cross-lingual overlap, providing evidence of genuine linguistic knowledge in LLMs. The work offers a comprehensive toolkit and dataset to advance interpretable and controllable language modeling for future research.

Abstract

Large language models (LLMs) demonstrate exceptional performance on tasks requiring complex linguistic abilities, such as reference disambiguation and metaphor recognition/generation. Although LLMs possess impressive capabilities, their internal mechanisms for processing and representing linguistic knowledge remain largely opaque. Prior research on linguistic mechanisms is limited by coarse granularity, limited analysis scale, and narrow focus. In this study, we propose LinguaLens, a systematic and comprehensive framework for analyzing the linguistic mechanisms of large language models, based on Sparse Auto-Encoders (SAEs). We extract a broad set of Chinese and English linguistic features across four dimensions (morphology, syntax, semantics, and pragmatics). By employing counterfactual methods, we construct a large-scale counterfactual dataset of linguistic features for mechanism analysis. Our findings reveal intrinsic representations of linguistic knowledge in LLMs, uncover patterns of cross-layer and cross-lingual distribution, and demonstrate the potential to control model outputs. This work provides a systematic suite of resources and methods for studying linguistic mechanisms, offers strong evidence that LLMs possess genuine linguistic knowledge, and lays the foundation for more interpretable and controllable language modeling in future research.

LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder

TL;DR

LinguaLens tackles the opacity of linguistic representations in large language models by introducing a sparse auto-encoder–based framework that decomposes hidden states into interpretable linguistic features across morphology, syntax, semantics, and pragmatics in both English and Chinese. It builds a large-scale counterfactual linguistic dataset and defines causal metrics, enabling feature extraction and intervention that validate representations and support controllable outputs. Experiments on the Llama-3.1-8B model show systematic discovery of features, cross-layer organization, and cross-lingual overlap, providing evidence of genuine linguistic knowledge in LLMs. The work offers a comprehensive toolkit and dataset to advance interpretable and controllable language modeling for future research.

Abstract

Large language models (LLMs) demonstrate exceptional performance on tasks requiring complex linguistic abilities, such as reference disambiguation and metaphor recognition/generation. Although LLMs possess impressive capabilities, their internal mechanisms for processing and representing linguistic knowledge remain largely opaque. Prior research on linguistic mechanisms is limited by coarse granularity, limited analysis scale, and narrow focus. In this study, we propose LinguaLens, a systematic and comprehensive framework for analyzing the linguistic mechanisms of large language models, based on Sparse Auto-Encoders (SAEs). We extract a broad set of Chinese and English linguistic features across four dimensions (morphology, syntax, semantics, and pragmatics). By employing counterfactual methods, we construct a large-scale counterfactual dataset of linguistic features for mechanism analysis. Our findings reveal intrinsic representations of linguistic knowledge in LLMs, uncover patterns of cross-layer and cross-lingual distribution, and demonstrate the potential to control model outputs. This work provides a systematic suite of resources and methods for studying linguistic mechanisms, offers strong evidence that LLMs possess genuine linguistic knowledge, and lays the foundation for more interpretable and controllable language modeling in future research.

Paper Structure

This paper contains 208 sections, 13 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The main linguistic features activated at different layers are observed when example sentences are input to the model. Through a Sparse Auto-Encoder, each layer’s activation values are mapped into a sparse space and the basis vectors corresponding to predefined linguistic features are extracted. According to the results, the model’s 32 layers are divided into four stages, in order: Morphology and Core Syntax, Complex Syntactic Constructions, Pragmatic Functions, and Deep Semantics and Rhetoric.
  • Figure 2: The overall framework of LinguaLens. We propose a framework for the linguistic mechanisms of large-scale models that encompasses four dimensions of theoretical linguistics and a cross‑lingual analysis of both Chinese and English. The experimental workflow is as follows: (1) Construct counterfactual datasets; (2) Extract features by analyzing the activation values of base vectors on the datasets; (3) Intervene in the model output by modifying activation values and assess causality using an LLM as a judge.
  • Figure 3: Heatmap of the overlap between Chinese and English feature sets across the SAE basis vectors at each of 32 layers. The horizontal axis groups Chinese and English features with analogous form and function—ordered by morphology, syntax, semantics, and pragmatics—while the vertical axis indexes the model layers. Darker red indicates greater overlap.
  • Figure 4: Activation value distributions of deep semantic corresponding features at layer 6 and 15 for reference ambiguity and metaphor example sentences.
  • Figure 5: Combined intervention results. Two figures separately present the enhancement and ablation experiment outcomes for the simile and politeness features at layer 26. In these experiments, multiple base vectors corresponding to each feature were jointly intervened.