Table of Contents
Fetching ...

SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Charles McGrady, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan

TL;DR

SciArena introduces an open, community-driven platform to evaluate foundation models on non-verifiable scientific literature-grounded tasks. It combines a multi-stage retrieval pipeline with long-form, citation-attributed responses and a human-voted Elo-style leaderboard, backed by 20,832 votes from 102 expert annotators across disciplines. It also releases SciArena-Eval, a meta-evaluation benchmark that measures how well model-based evaluators align with human judgments, revealing gaps that motivate more reliable automated evaluation approaches. The work highlights the value of open, consensus-driven evaluation for science AI and provides resources for future development.

Abstract

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature-grounded tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 47 foundation models and has collected over 20,000 votes from human researchers across diverse scientific domains. Our analysis of the data collected so far confirms its high quality. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on collected preference data. It measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.

SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

TL;DR

SciArena introduces an open, community-driven platform to evaluate foundation models on non-verifiable scientific literature-grounded tasks. It combines a multi-stage retrieval pipeline with long-form, citation-attributed responses and a human-voted Elo-style leaderboard, backed by 20,832 votes from 102 expert annotators across disciplines. It also releases SciArena-Eval, a meta-evaluation benchmark that measures how well model-based evaluators align with human judgments, revealing gaps that motivate more reliable automated evaluation approaches. The work highlights the value of open, consensus-driven evaluation for science AI and provides resources for future development.

Abstract

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature-grounded tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 47 foundation models and has collected over 20,000 votes from human researchers across diverse scientific domains. Our analysis of the data collected so far confirms its high quality. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on collected preference data. It measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.

Paper Structure

This paper contains 64 sections, 6 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: SciArena focuses on evaluating foundation models on scientific literature tasks. It consists of three main components: (1) a platform that collects human researcher preference votes between foundation models; (2) a leaderboard that ranks models using an Elo rating system based on these votes; and (3) the SciArena-Eval benchmark for assessing model-based evaluation systems.
  • Figure 2: An overview of the SciArena interface pipeline.
  • Figure 3: Statistics of the initial human preference data collected through SciArena, including voting information and distribution across question categories and scientific disciplines.
  • Figure 4: The prompt used for model response generation.
  • Figure 5: The prompt used for model response postprocessing using the GPT-4.1 model. Citations are then matched to the reference list using rule-based methods, and the finalized response is displayed in the SciArena interface.
  • ...and 8 more figures