Concept Heterogeneity-aware Representation Steering

Laziz U. Abdullaev; Noelle Y. L. Wong; Ryan T. Z. Lee; Shiqi Jiang; Khoi N. M. Nguyen; Tan M. Nguyen

Concept Heterogeneity-aware Representation Steering

Laziz U. Abdullaev, Noelle Y. L. Wong, Ryan T. Z. Lee, Shiqi Jiang, Khoi N. M. Nguyen, Tan M. Nguyen

TL;DR

This work views representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation.

Abstract

Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

Concept Heterogeneity-aware Representation Steering

TL;DR

Abstract

Paper Structure (33 sections, 3 theorems, 57 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 3 theorems, 57 equations, 11 figures, 6 tables, 1 algorithm.

Introduction
Background
Optimal Transport
Representation Steering as OT Between Gaussian Distributions
Extension to OT Between GMMs
Gaussian Mixture Wasserstein Distance
Concept Heterogeneity-aware Representation Steering via GMM-OT
Probabilistic Modeling of the Transport Map
Clustering-based Representation Steering
Principal Component Thresholding for Transport-aligned Steering Vectors
Controlling the Steering Effect
Jailbreaking Large Language Models
Toxicity Mitigation
Image Generation Style Control
Additional Empirical and Ablation Studies
...and 18 more sections

Key Result

Theorem 3.1

Let $\mu = \mathcal{N}(\mathbf{m}_1, \mathbf{\Sigma}_1)$ and $\nu = \mathcal{N}(\mathbf{m}_2, \mathbf{\Sigma}_2)$ be two Gaussian distributions on $\mathbb{R}^d$ with positive definite covariance matrices. For the quadratic cost $c(\mathbf{x}, \mathbf{y}) = \|\mathbf{x} - \mathbf{y}\|_2^2$, the foll

Figures (11)

Figure 1: PCA (left two) and t-SNE (right two) visualizations of last-token representations for Llama-3.2-3B-Instruct and Qwen2.5-7B-Instruct, colored by k-means clustering. The figures illustrate instances of feasible heterogeneity in concept representations. Appendix \ref{['app:semantic_clusters']} presents textual examples demonstrating that harmful instructions can be coherently grouped via clustering of their last-token hidden representations.
Figure 2: Top: Performance of ChaRS (left) and Linear-AcT (right) in inducing the style cyberpunk, measured by 0-shot classification score and CLIPScore. The green line shows the fraction of generated images classified as cyberpunk. The blue line shows similarity to the original prompt without style modification. Bottom: Pareto fronts illustrating that CHaRS achieves substantially better trade-off between style induction and content preservation.
Figure 3: Images generated using FLUX.1 [Dev] labs2025flux1kontextflowmatching intervened with CHaRS for the concept cyberpunk. Top: "A man standing in front of a few horses on the street." Bottom: "Pit bull playing with soccer ball in the grass.". A different style and more examples are presented in Appendix \ref{['t2i_task_pt2']}.
Figure 4: Explained variance captured by top-$k$ PCs ($k \in [15]$) for three different models.
Figure 5: ASR of CHaRS under ActAdd with respect to the number of clusters $K$. Although there is no clear correlation between ASR and $K$, it is generally true that there exist multiple $K>1$ that outperforms the baseline ($K=1$).
...and 6 more figures

Theorems & Definitions (10)

Definition 3.1: CHaRS
Remark 3.2
Definition 3.3: CHaRS-PCT
Theorem 3.1: Gaussian Optimal Transport
proof
Corollary 3.2: Isotropic Gaussians
proof
Corollary 3.3: Equal Covariances
proof
Remark 3.4: Connection to representation steering

Concept Heterogeneity-aware Representation Steering

TL;DR

Abstract

Concept Heterogeneity-aware Representation Steering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (10)