Challenges in Trustworthy Human Evaluation of Chatbots
Wenting Zhao, Alexander M. Rush, Tanya Goyal
TL;DR
This paper evaluates the trustworthiness of open, community-driven human judgments used to rank LLMs on platforms like Chatbot Arena, where rankings are derived via a Bradley–Terry model using pairwise probabilities $p(m_i>m_j)$. It shows that as few as $10\%$ of apathetic or adversarial votes can shift model rankings by several positions, and that arbitrary votes on subjective prompts introduce further uncertainty. The authors introduce a model-attribution attack and a detector with high true positive/negative rates, and demonstrate a live attack that can bypass guardrails, underscoring systemic vulnerabilities. They conclude with a call for stronger guardrails, richer, multi-dimensional feedback, and open data to improve reliability and enable broader research into evaluation robustness.
Abstract
Open community-driven platforms like Chatbot Arena that collect user preference data from site visitors have gained a reputation as one of the most trustworthy publicly available benchmarks for LLM performance. While now standard, it is tricky to implement effective guardrails to collect high-quality annotations from humans. In this paper, we demonstrate that three sources of bad annotations, both malicious and otherwise, can corrupt the reliability of open leaderboard rankings. In particular, we show that only 10\% of poor quality votes by apathetic (site visitors not appropriately incentivized to give correct votes) or adversarial (bad actors seeking to inflate the ranking of a target model) annotators can change the rankings of models by up to 5 places on the leaderboard. Finally, we discuss open challenges in ensuring high-quality human annotations.
