Table of Contents
Fetching ...

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

Andong Deng, Taojiannan Yang, Shoubin Yu, Lincoln Spencer, Mohit Bansal, Chen Chen, Serena Yeung-Levy, Xiaohan Wang

TL;DR

SciVideoBench introduces the first scientific video reasoning benchmark that requires deep domain knowledge to interpret real experimental content. It draws 241 JoVE videos across Physics, Chemistry, Biology, and Medicine and yields 1,000 multiple-choice questions grounded in aligned video, audio narration, and research papers. A semi-automatic, agent-assisted QA workflow ensures robust visual grounding and rigorous verification. Evaluation across 21 LMMs shows pronounced gaps, with Quantitative Reasoning hardest and chain-of-thought prompting delivering substantial performance gains, highlighting the need for improved visual grounding and numeric reasoning. The benchmark aims to accelerate progress toward AI systems capable of supporting real-world scientific work.

Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

TL;DR

SciVideoBench introduces the first scientific video reasoning benchmark that requires deep domain knowledge to interpret real experimental content. It draws 241 JoVE videos across Physics, Chemistry, Biology, and Medicine and yields 1,000 multiple-choice questions grounded in aligned video, audio narration, and research papers. A semi-automatic, agent-assisted QA workflow ensures robust visual grounding and rigorous verification. Evaluation across 21 LMMs shows pronounced gaps, with Quantitative Reasoning hardest and chain-of-thought prompting delivering substantial performance gains, highlighting the need for improved visual grounding and numeric reasoning. The benchmark aims to accelerate progress toward AI systems capable of supporting real-world scientific work.

Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

Paper Structure

This paper contains 33 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: SciVideoBench features research-level experimental videos accompanied by challenging questions that rigorously evaluate advanced video understanding. It emphasizes the synergistic interaction among accurate visual perception, expert knowledge, and sophisticated logical reasoning.
  • Figure 2: Overview of our annotation pipeline. We manually annotate QA pairs for three example videos using both the video and the associated paper. A multi-agent LLM system generates and refines QA pairs: the QA Generator produces initial questions, the Evaluator answers them with reasoning, the Visual Comparer checks for visual grounding and timestamps cues, and the Refiner ensures questions rely on video content and improves option quality. Human experts verify and refine the final QA pairs. Audio transcripts are omitted for simplicity.
  • Figure 3: Discipline and subject distribution in SciVideoBench. Our benchmark covers four major scientific disciplines—Biology, Chemistry, Medicine, and Physics—encompassing more than 25 specialized subjects. This diverse coverage ensures a comprehensive evaluation across a wide range of scientific domains.
  • Figure 4: Dataset statistics in SciVideoBench: (a) video duration, (b) question length, and (c) option length distributions. These statistics provide a comprehensive overview of the dataset’s temporal scale and linguistic diversity.
  • Figure 5: Examples of SciVideoBench, including videos from 4 disciplines (Physics, Biology, Chemistry, and Medicine), which involve 19 different subjects. The research-level QAs challenge LMMs in three different aspects (Conceptual, Hypothetical, and Quantitative) that are of vital importance in scientific experiment video understanding.
  • ...and 10 more figures