Table of Contents
Fetching ...

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

Tung-Ling Li, Yuhao Wu, Hongliang Liu

TL;DR

This work reveals that binary correctness judgments produced by reward models and LLM-based judges are vulnerable to short, low-perplexity control tokens that steer the last-layer logit gap and flip Yes/No decisions. It introduces AdvJudge-Zero, a zero-seed token discovery method that uses the model’s own next-token distribution and beam search to find diverse, realistic control-token sequences, supported by a geometric view of a low-rank perturbation (soft mode) anti-aligned with the judge’s refusal direction. Across open-weight model families (Qwen, Llama, Gemma) and specialized judges, these tokens yield very high false positive rates on math and reasoning tasks; adversarial training with token-augmented data substantially reduces FPR while preserving true-positive performance. The results emphasize the need for flip-aware defenses and demonstrate that compact, diverse token ensembles can serve as effective stress tests and training signals for more robust post-training pipelines. The findings have practical implications for the safety and reliability of RLHF/DPO/RLAIF workflows and motivate further development of defenses against reward-hacking in LLM evaluation systems.

Abstract

Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode'' that is anti-aligned with the judge's refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

TL;DR

This work reveals that binary correctness judgments produced by reward models and LLM-based judges are vulnerable to short, low-perplexity control tokens that steer the last-layer logit gap and flip Yes/No decisions. It introduces AdvJudge-Zero, a zero-seed token discovery method that uses the model’s own next-token distribution and beam search to find diverse, realistic control-token sequences, supported by a geometric view of a low-rank perturbation (soft mode) anti-aligned with the judge’s refusal direction. Across open-weight model families (Qwen, Llama, Gemma) and specialized judges, these tokens yield very high false positive rates on math and reasoning tasks; adversarial training with token-augmented data substantially reduces FPR while preserving true-positive performance. The results emphasize the need for flip-aware defenses and demonstrate that compact, diverse token ensembles can serve as effective stress tests and training signals for more robust post-training pipelines. The findings have practical implications for the safety and reliability of RLHF/DPO/RLAIF workflows and motivate further development of defenses against reward-hacking in LLM evaluation systems.

Abstract

Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode'' that is anti-aligned with the judge's refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.

Paper Structure

This paper contains 51 sections, 4 equations, 7 figures, 28 tables, 1 algorithm.

Figures (7)

  • Figure 1: Control tokens discovered by AdvJudge-Zero flip the judge’s Yes/No decision on a math solution by steering the last-layer logit gap, without improving the solution itself.
  • Figure 2: Comparison of Ensemble FPR vs. Baseline across four datasets (AIME, MATH, GSM8K, Multi-subject RLVR).
  • Figure 3: FPR vs. token length $n$ for Qwen2.5-7B-Instruct on AIME (left) and Llama-3.3-70B-Instruct on MATH (right).
  • Figure 4: FPR vs. Token Length (n) for google/gemma-3-4b-it across datasets. Top left: AIME, Top right: GSM8K, Bottom left: MATH, Bottom right: Multisubject RLVR.
  • Figure 5: FPR vs. Token Length (n) for meta-llama models across datasets. Top left: Llama-3.2-3B-Instruct (AIME), Top right: Llama-3.2-3B-Instruct (GSM8K), Bottom left: Llama-3.3-70B-Instruct (MATH), Bottom right: Llama-3.3-70B-Instruct (Multisubject RLVR).
  • ...and 2 more figures