Table of Contents
Fetching ...

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan Zhang

TL;DR

This work demonstrates that voting-based LLM leaderboards like Chatbot Arena can be adversarially manipulated through de-anonymization of anonymous model responses and targeted voting. It introduces two detector paradigms—identity-probing and training-based BoW/TFIDF approaches—that achieve over 95% accuracy in identifying a target model, enabling efficient reranking with thousands of adversarial votes. Through simulations and cost modeling, the authors quantify the resources needed to shift rankings and propose mitigations (authentication, rate limiting, malicious-user detection, and higher action costs) that substantially raise attack costs. The study provides a practical security framework for defendable, interactive evaluation platforms and highlights the need for robust, verifiable human-in-the-loop assessment in AI evaluation ecosystems.

Abstract

It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95\%$ accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. Some of these defenses were present before our collaboration, such as bot protection with Cloudflare, malicious user detection, and rate limiting. Others, including reCAPTCHA and login are being integrated to strengthen the security in Chatbot Arena.

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

TL;DR

This work demonstrates that voting-based LLM leaderboards like Chatbot Arena can be adversarially manipulated through de-anonymization of anonymous model responses and targeted voting. It introduces two detector paradigms—identity-probing and training-based BoW/TFIDF approaches—that achieve over 95% accuracy in identifying a target model, enabling efficient reranking with thousands of adversarial votes. Through simulations and cost modeling, the authors quantify the resources needed to shift rankings and propose mitigations (authentication, rate limiting, malicious-user detection, and higher action costs) that substantially raise attack costs. The study provides a practical security framework for defendable, interactive evaluation platforms and highlights the need for robust, verifiable human-in-the-loop assessment in AI evaluation ecosystems.

Abstract

It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. Some of these defenses were present before our collaboration, such as bot protection with Cloudflare, malicious user detection, and rate limiting. Others, including reCAPTCHA and login are being integrated to strengthen the security in Chatbot Arena.
Paper Structure (32 sections, 7 equations, 6 figures, 9 tables)

This paper contains 32 sections, 7 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Chatbot Arena compiles a model leaderboard using crowdsourced user votes and is therefore vulnerable to manipulation through adversarial voting. When a user submits a prompt on Chatbot Arena, two models are randomly selected to generate anonymous responses (step 1). Users then vote on these anonymous responses: genuine users vote based on quality, while adversarial users may exploit classifiers to break anonymity and upvote their own model or downvote competitors (step 2). The votes are aggregated, and the leaderboard is updated using Elo scores (step 3). As a result, adversarial voting can distort the model rankings.
  • Figure 2: First two principal components of bag-of-words ($\mathsf{BoW}$) features for model responses to three randomly selected English prompts (provided in \ref{['app:vis']}). Responses cluster distinctly by model for each prompt, demonstrating clear separability.
  • Figure 3: Test accuracy (%) of detectors trained to distinguish the target model (specified in each column) from other models (scale: 85% to 100%). Prompts featuring domain-specific tasks (e.g., "Math", "Coding", and "Safety-violating") and non-English languages (e.g., Spanish) yield the highest detection accuracy. Detectors are built using $\mathsf{BoW}$ features.
  • Figure 4: Scenario 1: The defender uses the likelihood to identify the malicious users. For a naive adversary who randomly chooses between untargetted models this approach can be effective, however, if the adversary uses existing public ranking it can bypass detection
  • Figure 5: Scenario 2: The defender releases a perturbed version of the leaderboard. Even when an adversary uses this perturbed leaderboard to choose between two untargeted models, their actions can still be detected. Increasing the amount of noise helps in detecting malicious users.
  • ...and 1 more figures