Table of Contents
Fetching ...

Large Language Models Encode Semantics and Alignment in Linearly Separable Representations

Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi

TL;DR

The paper investigates how large language models encode semantics and alignment in hidden representations, revealing that high-level semantic content lies in low-dimensional subspaces that form linearly separable clusters, especially in deeper layers and under prompts requiring structured reasoning. Using 11 decoder-only models across six domains and employing hard-margin SVMs to assess linear separability, the study shows that semantic structure persists even when domain keywords are masked, and that alignment cues (instruction following, safety) produce distinct, separable geometric patterns in latent space. Building on this geometry, the authors demonstrate a practical latent-space guardrail—a lightweight MLP trained on final-layer representations—that significantly improves refusal rates for harmful or adversarial prompts with minimal impact on benign utility. They also provide causal evidence that simple steering along centroid-difference vectors between topic clusters can nudge model behavior toward CoT-style reasoning. Overall, the work suggests transformer hidden spaces carry interpretable fingerprints of semantics and alignment, offering a promising direction for latent-space safety tools without modifying model weights.

Abstract

Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. Yet it remains unclear to what extent LLMs linearly organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across six scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior$\unicode{x2013}$even when surface content remains unchanged. These findings motivate geometry-aware tools that operate directly in latent space to detect and mitigate harmful and adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model's built-in safety alignment and external token-level filters.

Large Language Models Encode Semantics and Alignment in Linearly Separable Representations

TL;DR

The paper investigates how large language models encode semantics and alignment in hidden representations, revealing that high-level semantic content lies in low-dimensional subspaces that form linearly separable clusters, especially in deeper layers and under prompts requiring structured reasoning. Using 11 decoder-only models across six domains and employing hard-margin SVMs to assess linear separability, the study shows that semantic structure persists even when domain keywords are masked, and that alignment cues (instruction following, safety) produce distinct, separable geometric patterns in latent space. Building on this geometry, the authors demonstrate a practical latent-space guardrail—a lightweight MLP trained on final-layer representations—that significantly improves refusal rates for harmful or adversarial prompts with minimal impact on benign utility. They also provide causal evidence that simple steering along centroid-difference vectors between topic clusters can nudge model behavior toward CoT-style reasoning. Overall, the work suggests transformer hidden spaces carry interpretable fingerprints of semantics and alignment, offering a promising direction for latent-space safety tools without modifying model weights.

Abstract

Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. Yet it remains unclear to what extent LLMs linearly organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across six scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavioreven when surface content remains unchanged. These findings motivate geometry-aware tools that operate directly in latent space to detect and mitigate harmful and adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model's built-in safety alignment and external token-level filters.

Paper Structure

This paper contains 73 sections, 8 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: SVM classification accuracy on representations of scientific abstracts as a function of layer depth. Results are averaged over 15 pairwise accuracies. Darker colors represent the larger model within each model family. The sub-1.0 average accuracy reflects that most topic pairs are separable, while a few are not—resulting in high but not perfect accuracy.
  • Figure 2: SVM classification accuracy on representations of masked scientific abstracts as a function of the keyword-masking threshold. Each point is the average over 15 pairwise accuracies. Results are shown for the final layers of Mistral-24B (dark green) and Llama 3.1-8B (dark orange).
  • Figure 3: SVM classification accuracy on the representations of the same prompt with and without a one-sentence chain-of-thought instruction. Results are averaged over individual accuracies from questions in CommonsenseQA, GSM8K, and MMLU. Darker colors indicate the larger model within each model family.
  • Figure 4: Conceptual illustration of hidden representations showing clustering patterns across four prompt types. Cluster positions are based on Wasserstein distances, with cluster sizes reflecting variance. Dashed lines indicate linear decision boundaries.
  • Figure 5: Refusal rates across evaluation datasets for responses generated by a Llama 3.1-8B Instruct model. A paired McNemar test ($p<0.05$) confirms that our latent-space guardrail significantly alters prompt handling—achieving higher refusal rates on harmful inputs and prompt injections compared to the baselines.
  • ...and 9 more figures