Table of Contents
Fetching ...

Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

Haiyan Zhao, Heng Zhao, Bo Shen, Ali Payani, Fan Yang, Mengnan Du

TL;DR

This work addresses the instability of single concept vectors in LLM representations by introducing Gaussian Concept Subspace (GCS), which describes each concept as a distribution over multiple vectors learned via random probing subsets. GCS demonstrates faithfulness (high intra-set similarity, strong cross-set similarity) and plausibility (category-aware clustering and PCA-supported hierarchies) across diverse models, enabling robust inference-time interventions. Empirically, sampled GCS vectors achieve comparable or superior predictive accuracy to observed vectors and effectively steer outputs (e.g., toward joyful movie reviews) while preserving fluency. The approach offers a principled, scalable framework for concept-aware interpretability and controllable generation in LLMs with potential applications in alignment and safety.

Abstract

Probing learned concepts in large language models (LLMs) is crucial for understanding how semantic knowledge is encoded internally. Training linear classifiers on probing tasks is a principle approach to denote the vector of a certain concept in the representation space. However, the single vector identified for a concept varies with both data and training, making it less robust and weakening its effectiveness in real-world applications. To address this challenge, we propose an approach to approximate the subspace representing a specific concept. Built on linear probing classifiers, we extend the concept vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS's effectiveness through measuring its faithfulness and plausibility across multiple LLMs with different sizes and architectures. Additionally, we use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering. Experimental results indicate that GCS concept vectors have the potential to balance steering performance and maintaining the fluency in natural language generation tasks.

Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

TL;DR

This work addresses the instability of single concept vectors in LLM representations by introducing Gaussian Concept Subspace (GCS), which describes each concept as a distribution over multiple vectors learned via random probing subsets. GCS demonstrates faithfulness (high intra-set similarity, strong cross-set similarity) and plausibility (category-aware clustering and PCA-supported hierarchies) across diverse models, enabling robust inference-time interventions. Empirically, sampled GCS vectors achieve comparable or superior predictive accuracy to observed vectors and effectively steer outputs (e.g., toward joyful movie reviews) while preserving fluency. The approach offers a principled, scalable framework for concept-aware interpretability and controllable generation in LLMs with potential applications in alignment and safety.

Abstract

Probing learned concepts in large language models (LLMs) is crucial for understanding how semantic knowledge is encoded internally. Training linear classifiers on probing tasks is a principle approach to denote the vector of a certain concept in the representation space. However, the single vector identified for a concept varies with both data and training, making it less robust and weakening its effectiveness in real-world applications. To address this challenge, we propose an approach to approximate the subspace representing a specific concept. Built on linear probing classifiers, we extend the concept vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS's effectiveness through measuring its faithfulness and plausibility across multiple LLMs with different sizes and architectures. Additionally, we use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering. Experimental results indicate that GCS concept vectors have the potential to balance steering performance and maintaining the fluency in natural language generation tasks.
Paper Structure (74 sections, 10 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 74 sections, 10 equations, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: Hierarchical concepts.
  • Figure 2: Histogram of cosine similarity within observed concept vectors, sampled concept vectors, and between both sets for concept "Bird" (a-c). Layer-wise average cosine similarity from the second layer to the penultimate layer of observed concept vectors, sampled concept vectors, and between both sets for concept "Bird" (d-f).
  • Figure 3: Accuracy of observed and sampled concept vectors aross varying models.
  • Figure 4: Heatmap of concept average cosine similarity of 16 concepts across Llama-2-7B, Gemma-7B, and Llama-2-13B. The 16 low-level concepts are grouped into four high-level categories: the first 4 rows/columns represent sports events, the next 4 represent populated places, followed by 4 for animals, and the last 4 for movie genres.
  • Figure 5: PCA visualization of 16 concepts across Llama-2-7B, Gemma-7B, and Llama-2-13B. Low-level concepts belonging to the same high-level concept category share the same color.
  • ...and 6 more figures