Table of Contents
Fetching ...

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Zhiwei Steven Wu, Zhun Deng

TL;DR

The paper identifies Rubric-Induced Preference Drift (RIPD), a vulnerability where natural-language rubric edits that preserve benchmark performance still steer LLM judges toward target-domain preferences that diverge from a fixed reference. It operationalizes a rubric-based preference attack, showing target-domain degradation up to 9.5% in helpfulness and 27.9% in harmlessness, and demonstrates that this drift propagates through Judge→Label→Alignment pipelines to produce persistent policy misalignment. Through experiments on multiple datasets and models, RIPD proves robust across judges and transferable to different architectures, underscoring rubric design as a critical, manipulable control interface in evaluation pipelines. The work highlights the need to incorporate rubric refinement and validation into alignment workflows, and it provides code to support future investigation and defense against such attacks.

Abstract

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

TL;DR

The paper identifies Rubric-Induced Preference Drift (RIPD), a vulnerability where natural-language rubric edits that preserve benchmark performance still steer LLM judges toward target-domain preferences that diverge from a fixed reference. It operationalizes a rubric-based preference attack, showing target-domain degradation up to 9.5% in helpfulness and 27.9% in harmlessness, and demonstrates that this drift propagates through Judge→Label→Alignment pipelines to produce persistent policy misalignment. Through experiments on multiple datasets and models, RIPD proves robust across judges and transferable to different architectures, underscoring rubric design as a critical, manipulable control interface in evaluation pipelines. The work highlights the need to incorporate rubric refinement and validation into alignment workflows, and it provides code to support future investigation and defense against such attacks.

Abstract

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.
Paper Structure (30 sections, 5 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 30 sections, 5 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Rubric-Induced Preference Drift in LLM-Based Judging Pipelines.
  • Figure 2: The adversary is limited to editing the rubrics and cannot access model internals, or observe unseen data. Benchmark and target domains follow identical access protocols.
  • Figure 4: A case study of stealthy rubric-induced preference drift. Despite preserving benchmark compliance, rubric refinements systematically bias judge decisions on target domains, causing downstream policy behaviors to diverge from the intended objective under both helpfulness and harmlessness tasks.
  • Figure : (a) Gemma-2-2B-it
  • Figure : (a) Gemma-2-2B-it
  • ...and 1 more figures