Table of Contents
Fetching ...

Fooling LLM graders into giving better grades through neural activity guided adversarial prompting

Atsushi Yamamura, Surya Ganguli

TL;DR

The paper tackles biases in LLM-based evaluators by proposing a systematic method to uncover hidden neural representations that predict scoring. It first identifies a cognitive state in the model via linear readouts trained on residual activations and then crafts adversarial suffixes with a gradient-based search to amplify that state, producing high scores in automated grading. The method demonstrates strong cross-model transfer and reveals a consistent 'magic word' bias tied to chat templates, which can be mitigated by a simple change in the prompting template. Overall, the work highlights the need for bias-aware design and template-aware defenses to improve the safety, fairness, and robustness of LLM-powered evaluation systems.

Abstract

The deployment of artificial intelligence (AI) in critical decision-making and evaluation processes raises concerns about inherent biases that malicious actors could exploit to distort decision outcomes. We propose a systematic method to reveal such biases in AI evaluation systems and apply it to automated essay grading as an example. Our approach first identifies hidden neural activity patterns that predict distorted decision outcomes and then optimizes an adversarial input suffix to amplify such patterns. We demonstrate that this combination can effectively fool large language model (LLM) graders into assigning much higher grades than humans would. We further show that this white-box attack transfers to black-box attacks on other models, including commercial closed-source models like Gemini. They further reveal the existence of a "magic word" that plays a pivotal role in the efficacy of the attack. We trace the origin of this magic word bias to the structure of commonly-used chat templates for supervised fine-tuning of LLMs and show that a minor change in the template can drastically reduce the bias. This work not only uncovers vulnerabilities in current LLMs but also proposes a systematic method to identify and remove hidden biases, contributing to the goal of ensuring AI safety and security.

Fooling LLM graders into giving better grades through neural activity guided adversarial prompting

TL;DR

The paper tackles biases in LLM-based evaluators by proposing a systematic method to uncover hidden neural representations that predict scoring. It first identifies a cognitive state in the model via linear readouts trained on residual activations and then crafts adversarial suffixes with a gradient-based search to amplify that state, producing high scores in automated grading. The method demonstrates strong cross-model transfer and reveals a consistent 'magic word' bias tied to chat templates, which can be mitigated by a simple change in the prompting template. Overall, the work highlights the need for bias-aware design and template-aware defenses to improve the safety, fairness, and robustness of LLM-powered evaluation systems.

Abstract

The deployment of artificial intelligence (AI) in critical decision-making and evaluation processes raises concerns about inherent biases that malicious actors could exploit to distort decision outcomes. We propose a systematic method to reveal such biases in AI evaluation systems and apply it to automated essay grading as an example. Our approach first identifies hidden neural activity patterns that predict distorted decision outcomes and then optimizes an adversarial input suffix to amplify such patterns. We demonstrate that this combination can effectively fool large language model (LLM) graders into assigning much higher grades than humans would. We further show that this white-box attack transfers to black-box attacks on other models, including commercial closed-source models like Gemini. They further reveal the existence of a "magic word" that plays a pivotal role in the efficacy of the attack. We trace the origin of this magic word bias to the structure of commonly-used chat templates for supervised fine-tuning of LLMs and show that a minor change in the template can drastically reduce the bias. This work not only uncovers vulnerabilities in current LLMs but also proposes a systematic method to identify and remove hidden biases, contributing to the goal of ensuring AI safety and security.

Paper Structure

This paper contains 15 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Illustration of systematic bias injection into machine evaluators (a) We first train a linear readout to identify internal activation patterns that can predict the model's final evaluation. (b) We then optimize adversarial input suffixes to amplify internal activation patterns that predict high scores. Such suffixes can reveal subtle LLM biases that can be exploited to distort decision outcomes.
  • Figure 2: LLMs decide scores internally, much earlier than their explicit output. (a) An illustration of how the scores are obtained. The linear readout predicts the final score distribution from activation pattern in the residual stream of a given layer at a given token position. In particular we consider the readout at the end of the student essay and the end of the entire input. (b,c) Comparison of averaged scores given by LLama3-8B-Instruct and the trained linear readout prediction at $16$th layer out of $32$ layers in the model. Each blue dot represents a held-out essay, which is not used to train the linear readout weights. Note that while individual essay scores are integers, the scores displayed here are average scores weighted by the respective output distributions, and hence can be non-integer.
  • Figure 3: The readout weight vectors associated with the highest score largely overlap across different essay problem sets and prompt templates. For each fixed essay problem set and prompt template, we obtain the readout weight at the end of the essay input at layer $16$ of the Llama3.1-8B-instruct model, and compute the cosine similarity between different readouts. While the highest score of problem #1 and #2 are different, their weights are well-aligned. We take the average of these four weight vectors, which we interpret as a cognitive state corresponding to the highest score in a universal context.
  • Figure 4: LLM-graded scores are elevated by the optimized adversarial suffixes in Table \ref{['table:adv_suffixes']} with Llama3.1-8B-Instruct. In scatter plots, each point corresponds to a held-out student essay that is not used in our adversarial suffix optimization.
  • Figure 5: Our adversarial suffix is effective in attacking different language models. We measure the effectiveness of the suffix #1 in Table \ref{['table:adv_suffixes']} in attacking various different language models. The essay problem sets (#3 and #4) and the prompt templates (#3 and #4) used here are different from those used for adversarial suffix optimization. In scatter plots, each point corresponds to a held-out essay. We show additional results with other models in Figure \ref{['fig:scores_llms_additional']} in Appendix.
  • ...and 6 more figures