Table of Contents
Fetching ...

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu

TL;DR

ScienceBoard introduces a first-of-its-kind, realistic environment and a 169-task benchmark to evaluate computer-using agents operating across multi-domain scientific workflows. The approach combines GUI and CLI interactions with open-source scientific software inside a VM, and uses a fine-grained, state-based evaluation framework to assess task completion. Across state-of-the-art backbones, agents achieve only about 15% average success, underscoring substantial gaps between current capabilities and autonomous scientific discovery. The work highlights modular planning/grounding, hybrid observation modalities, and domain-knowledge integration as key directions, and points toward collaborative, specialized agents and lab-in-the-loop extensions as promising avenues for impactful AI-driven science.

Abstract

Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

TL;DR

ScienceBoard introduces a first-of-its-kind, realistic environment and a 169-task benchmark to evaluate computer-using agents operating across multi-domain scientific workflows. The approach combines GUI and CLI interactions with open-source scientific software inside a VM, and uses a fine-grained, state-based evaluation framework to assess task completion. Across state-of-the-art backbones, agents achieve only about 15% average success, underscoring substantial gaps between current capabilities and autonomous scientific discovery. The work highlights modular planning/grounding, hybrid observation modalities, and domain-knowledge integration as key directions, and points toward collaborative, specialized agents and lab-in-the-loop extensions as promising avenues for impactful AI-driven science.

Abstract

Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

Paper Structure

This paper contains 69 sections, 1 equation, 23 figures, 8 tables.

Figures (23)

  • Figure 1: ScienceBoard is a pioneering computer environment for scientific discovery agents, integrated with a suite of professional software and tools. It serves as an infrastructure enabling computer-using agents to assist in scientific workflows. Based on instructions, agents autonomously interact with the environment via GUI actions or generated code to complete realistic tasks.
  • Figure 2: Overview of the ScienceBoard infrastructure. The scalable environment is built upon a VM pre-installed with scientific discovery software. It supports both CLI and GUI interfaces to enable autonomous agent interaction. For each task designed to evaluate the agent’s capability as a research assistant, an initialization script, configs, and related files are provided. Agents perceive the environment through visual or textual modalities, and are expected to plan and act accordingly. After the interaction, an evaluation function determines completion based on the VM internal states.
  • Figure 2: Statistics of ScienceBoard.
  • Figure 3: The annotation pipeline of the tasks in ScienceBoard benchmark.
  • Figure 4: Distribution of tasks in ScienceBoard benchmark.
  • ...and 18 more figures