Table of Contents
Fetching ...

Concept Heterogeneity-aware Representation Steering

Laziz U. Abdullaev, Noelle Y. L. Wong, Ryan T. Z. Lee, Shiqi Jiang, Khoi N. M. Nguyen, Tan M. Nguyen

TL;DR

This work views representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation.

Abstract

Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

Concept Heterogeneity-aware Representation Steering

TL;DR

This work views representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation.

Abstract

Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.
Paper Structure (33 sections, 3 theorems, 57 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 3 theorems, 57 equations, 11 figures, 6 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $\mu = \mathcal{N}(\mathbf{m}_1, \mathbf{\Sigma}_1)$ and $\nu = \mathcal{N}(\mathbf{m}_2, \mathbf{\Sigma}_2)$ be two Gaussian distributions on $\mathbb{R}^d$ with positive definite covariance matrices. For the quadratic cost $c(\mathbf{x}, \mathbf{y}) = \|\mathbf{x} - \mathbf{y}\|_2^2$, the foll

Figures (11)

  • Figure 1: PCA (left two) and t-SNE (right two) visualizations of last-token representations for Llama-3.2-3B-Instruct and Qwen2.5-7B-Instruct, colored by k-means clustering. The figures illustrate instances of feasible heterogeneity in concept representations. Appendix \ref{['app:semantic_clusters']} presents textual examples demonstrating that harmful instructions can be coherently grouped via clustering of their last-token hidden representations.
  • Figure 2: Top: Performance of ChaRS (left) and Linear-AcT (right) in inducing the style cyberpunk, measured by 0-shot classification score and CLIPScore. The green line shows the fraction of generated images classified as cyberpunk. The blue line shows similarity to the original prompt without style modification. Bottom: Pareto fronts illustrating that CHaRS achieves substantially better trade-off between style induction and content preservation.
  • Figure 3: Images generated using FLUX.1 [Dev] labs2025flux1kontextflowmatching intervened with CHaRS for the concept cyberpunk. Top: "A man standing in front of a few horses on the street." Bottom: "Pit bull playing with soccer ball in the grass.". A different style and more examples are presented in Appendix \ref{['t2i_task_pt2']}.
  • Figure 4: Explained variance captured by top-$k$ PCs ($k \in [15]$) for three different models.
  • Figure 5: ASR of CHaRS under ActAdd with respect to the number of clusters $K$. Although there is no clear correlation between ASR and $K$, it is generally true that there exist multiple $K>1$ that outperforms the baseline ($K=1$).
  • ...and 6 more figures

Theorems & Definitions (10)

  • Definition 3.1: CHaRS
  • Remark 3.2
  • Definition 3.3: CHaRS-PCT
  • Theorem 3.1: Gaussian Optimal Transport
  • proof
  • Corollary 3.2: Isotropic Gaussians
  • proof
  • Corollary 3.3: Equal Covariances
  • proof
  • Remark 3.4: Connection to representation steering