Table of Contents
Fetching ...

Polarity-Aware Probing for Quantifying Latent Alignment in Language Models

Sabrina Sadiekh, Elena Ericheva, Chirag Agarwal

TL;DR

The paper tackles whether unsupervised probes can reliably reveal a model's internal alignment by analyzing latent beliefs under polarity changes. It extends Contrast-Consistent Search (CCS) with Polarity-Aware CCS (PA-CCS), introducing Polar Consistency and Contradiction Index to quantify how consistently a model encodes harmful vs. safe statements across polarity inversions. Through experiments on three datasets and 16 transformer models, PA-CCS uncovers architecture- and scale-dependent polarity signals, with instruction-tuned models showing more stable, coherent internal representations. The work demonstrates the value of unsupervised, polarity-aware probes for latent alignment assessment and advocates robustness checks to distinguish genuine polarity understanding from surface cues.

Abstract

Advances in unsupervised probes such as Contrast-Consistent Search (CCS), which reveal latent beliefs without relying on token outputs, raise the question of whether these methods can reliably assess model alignment. We investigate this by examining the sensitivity of CCS to harmful vs. safe statements and by introducing Polarity-Aware CCS (PA-CCS), a method for evaluating whether a model's internal representations remain consistent under polarity inversion. We propose two alignment-oriented metrics, Polar-Consistency and the Contradiction Index, to quantify the semantic robustness of a model's latent knowledge. To validate PA-CCS, we curate two main datasets and one control dataset containing matched harmful-safe sentence pairs constructed using different methodologies (concurrent and antagonistic statements). We apply PA-CCS to 16 language models. Our results show that PA-CCS identifies both architectural and layer-specific differences in the encoding of latent harmful knowledge. Notably, replacing the negation token with a meaningless marker degrades PA-CCS scores for models with well-aligned internal representations, while models lacking robust internal calibration do not exhibit this degradation. Our findings highlight the potential of unsupervised probing for alignment evaluation and emphasize the need to incorporate structural robustness checks into interpretability benchmarks. Code and datasets are available at: https://github.com/SadSabrina/polarity-probing. WARNING: This paper contains potentially sensitive, harmful, and offensive content.

Polarity-Aware Probing for Quantifying Latent Alignment in Language Models

TL;DR

The paper tackles whether unsupervised probes can reliably reveal a model's internal alignment by analyzing latent beliefs under polarity changes. It extends Contrast-Consistent Search (CCS) with Polarity-Aware CCS (PA-CCS), introducing Polar Consistency and Contradiction Index to quantify how consistently a model encodes harmful vs. safe statements across polarity inversions. Through experiments on three datasets and 16 transformer models, PA-CCS uncovers architecture- and scale-dependent polarity signals, with instruction-tuned models showing more stable, coherent internal representations. The work demonstrates the value of unsupervised, polarity-aware probes for latent alignment assessment and advocates robustness checks to distinguish genuine polarity understanding from surface cues.

Abstract

Advances in unsupervised probes such as Contrast-Consistent Search (CCS), which reveal latent beliefs without relying on token outputs, raise the question of whether these methods can reliably assess model alignment. We investigate this by examining the sensitivity of CCS to harmful vs. safe statements and by introducing Polarity-Aware CCS (PA-CCS), a method for evaluating whether a model's internal representations remain consistent under polarity inversion. We propose two alignment-oriented metrics, Polar-Consistency and the Contradiction Index, to quantify the semantic robustness of a model's latent knowledge. To validate PA-CCS, we curate two main datasets and one control dataset containing matched harmful-safe sentence pairs constructed using different methodologies (concurrent and antagonistic statements). We apply PA-CCS to 16 language models. Our results show that PA-CCS identifies both architectural and layer-specific differences in the encoding of latent harmful knowledge. Notably, replacing the negation token with a meaningless marker degrades PA-CCS scores for models with well-aligned internal representations, while models lacking robust internal calibration do not exhibit this degradation. Our findings highlight the potential of unsupervised probing for alignment evaluation and emphasize the need to incorporate structural robustness checks into interpretability benchmarks. Code and datasets are available at: https://github.com/SadSabrina/polarity-probing. WARNING: This paper contains potentially sensitive, harmful, and offensive content.

Paper Structure

This paper contains 19 sections, 4 equations, 34 figures, 5 tables.

Figures (34)

  • Figure 1: Overview of the PA-CCS framework.A) The process begins with a set of matched sentence pairs $S^{\{\text{safe,harm}\}}$. B) These pairs are transformed into contrastive inputs $A^{\{+, -\}}, \bar{A}^{\{+, -\}}$ via basic CCS suffixes and C) passed through all layers of a frozen language model. For each layer, hidden representations $x^{\{+, -\}}, \bar{x}^{\{+, -\}}$ of both statements are extracted. D) A linear CCS probe is trained to classify belief polarity based on the difference between hidden states. The resulting direction is used to project representations into four scores. E) These scores are then used to compute two alignment-sensitive metrics: Polar Consistency (PC)$\in [-1, 1]$ and Contradiction Index (CI)$\in [0, 2]$. The entire process is repeated across all layers. The plot in step E illustrates the distribution of all possible combinations of theoretical scores in the space of PC and CI. Each point is colored according to a predefined categorization scheme reflecting empirical separation accuracy (ESA) and the presence or absence of polarity. Regions with strong separation between safe and harmful statements ($ESA \geq 0.75$) cluster near low PC and moderate CI, inverted regions have negative PC, and non-polarized cases have elevated CI or low ($\in (0.05, 0.25)$ ) PC and CI both.
  • Figure 2: Mean accuracy, polar consistency, and contradiction index on base (orange and blue lines) and control experiments with random polarity token (ttt, green line) with 95% conf. interval across all layers of large models (bottom, 239 layers) and small (top, 204 layers) with accuracy $\geq 0.625$. The PA-CCS metrics improve significantly, indicating a gain in polarity alignment.
  • Figure 3: Comparison of PA-CCS metrics between encoder and decoder models across datasets. Encoders exhibit lower variance, while medians remain similar. When comparing the encoder and decoder parts for only the encoder-decoder models (bert2BERT), the same trend is observed.
  • Figure 4: Trade off between PC and CI metrics on mixed and not datasets for large models (guard, instruct, vanilla of the Llama-8B and (instruct, vanilla) of Gemmas 2B and 9B). Median values that allow achieving separation accuracy $\geq 0.75$ are 0.055 (PC) and 0.410 (CI).
  • Figure 5: Impact of instruction and alignment tuning (dark blue) on PA-CCS. For layers of large models (top) (157 layers of vanilla models vs 190 layers of finetuned models), instruction-tuned variants demonstrate higher alignment accuracy, lower contradiction, and more consistent polarity behavior. For smaller (bottom) models, task-specific pretraining (114 layers for each dataset of non-pretrained models, 90 layers for each dataset of pretrained models) leads to similar improvements. Finetuning systematically reduces variance and enhances model robustness.
  • ...and 29 more figures