Table of Contents
Fetching ...

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

Yicheng Xu, Yue Wu, Jiashuo Yu, Ziang Yan, Tianxiang Jiang, Yinan He, Qingsong Zhao, Kai Chen, Yu Qiao, Limin Wang, Manabu Okumura, Yi Wang

TL;DR

ExpVid addresses a critical gap in evaluating how well multimodal systems understand real laboratory experiments by introducing a vision-grounded benchmark built from JoVE videos and linked peer-reviewed papers. It presents a scalable annotation pipeline and a three-level task hierarchy that spans fine-grained perception, procedural understanding, and scientific reasoning, enabling evaluation across short, intermediate, and long-horizon contexts. Benchmark results across 19 models reveal that while coarse perception is improving, high-level reasoning and long-range integration remain challenging, with a notable gap between open-source and proprietary systems and with vision being essential for long-video tasks. By diagnosing current capabilities and outlining a roadmap for future development, ExpVid aims to foster trustworthy AI assistants capable of perceiving, verifying, and reasoning about real experiments.

Abstract

Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

TL;DR

ExpVid addresses a critical gap in evaluating how well multimodal systems understand real laboratory experiments by introducing a vision-grounded benchmark built from JoVE videos and linked peer-reviewed papers. It presents a scalable annotation pipeline and a three-level task hierarchy that spans fine-grained perception, procedural understanding, and scientific reasoning, enabling evaluation across short, intermediate, and long-horizon contexts. Benchmark results across 19 models reveal that while coarse perception is improving, high-level reasoning and long-range integration remain challenging, with a notable gap between open-source and proprietary systems and with vision being essential for long-video tasks. By diagnosing current capabilities and outlining a roadmap for future development, ExpVid aims to foster trustworthy AI assistants capable of perceiving, verifying, and reasoning about real experiments.

Abstract

Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.

Paper Structure

This paper contains 56 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Illustration of three-level task hierarchy in ExpVid.
  • Figure 2: An overview of ExpVid construction pipeline.
  • Figure 3: Effect of input video frames.
  • Figure 4: Data statistics in ExpVid collection and filtering. (a) Number of experiment videos per discipline before filtering. (b) Video duration distribution with mean 489s and median 505s, showing long-tail outliers beyond 2,000s. (c) Boxplot of video duration by discipline (whiskers at 1.5$\times$IQR). (d) Boxplot of video duration by quality based on the multi-dimensional scoring process.
  • Figure 5: Data statistics of video duration and annotations in ExpVid. (a) Average video/clip duration and standard deviation across the three levels (log scale). (b) Number of annotations for each task. (c) Average number of words per annotation with standard deviation. (d) Average number of annotations per full experimental video across different tasks, with standard deviation.
  • ...and 5 more figures