Subliminal Signals in Preference Labels

Isotta Magistrali; Frédéric Berdoz; Sam Dauncey; Roger Wattenhofer

Subliminal Signals in Preference Labels

Isotta Magistrali, Frédéric Berdoz, Sam Dauncey, Roger Wattenhofer

TL;DR

It is shown that even when a neutral student model generates semantically unbiased completions, a biased judge can transmit unintended behavioral traits through preference assignments through preference assignments, which even strengthen across iterative alignment rounds.

Abstract

As AI systems approach superhuman capabilities, scalable oversight increasingly relies on LLM-as-a-judge frameworks where models evaluate and guide each other's training. A core assumption is that binary preference labels provide only semantic supervision about response quality. We challenge this assumption by demonstrating that preference labels can function as a covert communication channel. We show that even when a neutral student model generates semantically unbiased completions, a biased judge can transmit unintended behavioral traits through preference assignments, which even strengthen across iterative alignment rounds. Our findings suggest that robust oversight in superalignment settings requires mechanisms that can detect and mitigate subliminal preference transmission, particularly when judges may pursue unintended objectives.

Subliminal Signals in Preference Labels

TL;DR

Abstract

Paper Structure (25 sections, 4 equations, 4 figures, 7 tables)

This paper contains 25 sections, 4 equations, 4 figures, 7 tables.

Introduction
Related Work
LLMs as Judges.
Subliminal and Covert Learning in Language Models.
Methodology
Prompt Generation and Completions.
Preference Dataset Construction.
Alignment.
Iterative alignment.
Evaluation.
Experiments
Experimental Setup
Evaluation Metrics for Subliminal Transfer
Conclusion
Pipeline Steps Details
...and 10 more sections

Figures (4)

Figure 1: Overview of our experimental framework. A neutral student model generates multiple candidate completions for a given prompt. A biased judge model evaluates them to construct a preference dataset, then used to align the student model. Unlike prior subliminal learning studies cloud2025subliminal where the biased model itself generates the training data encoding hundreds of bits per sample, here the student produces unbiased numerical sequences while bias originates solely from an external judge's binary preference labels, transmitting only one bit per sample.
Figure 2: Consistency of preference divergence. \ref{['fig:sft-a']} We evaluate cloud2025subliminal student models through out framework. No swapped version exists so we consider the difference between biased and control models as the total effect size. We observe that consistency breaks down for lion. \ref{['fig:sft-b']} Our variant of the judging procedure with generic prompt ("Produce numbers"): the target animal always exhibits consistency of preference divergence.
Figure 3: Overview of alignment strategies. For each preference pair (Prompt, C+, C-), we evaluate four training configurations: (Top) SFT on preferred completions (C+) and SFT on dispreferred completions (C-) (swapped condition). (Bottom) Direct DPO using C+ as chosen and C- as rejected (normal), and DPO with reversed labels (swapped). The normal and swapped configurations allow us to verify that observed effects are attributable to the directional preference signal rather than artifacts of the alignment procedure. DPO can also be performed sequentially after SFT to enhance training stability.
Figure 4: Iterative alignment procedure. Aligned models from the first iteration serve as student models to generate new completions, which are judged to create updated preference datasets for a second round of alignment. The process maintains configuration consistency (normal-to-normal, swapped-to-swapped), allowing us to track how preference signal transmission evolves across successive alignment iterations. In the figure the process for SFT is displayed, but the same procedure can be applied for DPO.

Subliminal Signals in Preference Labels

TL;DR

Abstract

Subliminal Signals in Preference Labels

Authors

TL;DR

Abstract

Table of Contents

Figures (4)