SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

Yilun Zhao; Kaiyan Zhang; Tiansheng Hu; Sihong Wu; Ronan Le Bras; Charles McGrady; Taira Anderson; Jonathan Bragg; Joseph Chee Chang; Jesse Dodge; Matt Latzke; Yixin Liu; Xiangru Tang; Zihang Wang; Chen Zhao; Hannaneh Hajishirzi; Doug Downey; Arman Cohan

SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Charles McGrady, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan

TL;DR

SciArena introduces an open, community-driven platform to evaluate foundation models on non-verifiable scientific literature-grounded tasks. It combines a multi-stage retrieval pipeline with long-form, citation-attributed responses and a human-voted Elo-style leaderboard, backed by 20,832 votes from 102 expert annotators across disciplines. It also releases SciArena-Eval, a meta-evaluation benchmark that measures how well model-based evaluators align with human judgments, revealing gaps that motivate more reliable automated evaluation approaches. The work highlights the value of open, consensus-driven evaluation for science AI and provides resources for future development.

Abstract

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature-grounded tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 47 foundation models and has collected over 20,000 votes from human researchers across diverse scientific domains. Our analysis of the data collected so far confirms its high quality. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on collected preference data. It measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.

SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

TL;DR

Abstract

SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)