Table of Contents
Fetching ...

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

Zhihuan Jiang, Zhen Yang, Jinhao Chen, Zhengxiao Du, Weihan Wang, Bin Xu, Jie Tang

TL;DR

VisScience fills a critical gap by providing a bilingual, multimodal benchmark that spans mathematics, physics, and chemistry, with 3,000 questions drawn from K12 education across 21 subjects and 5 difficulty levels. It introduces a two-stage data-generation pipeline starting from 450,000 bilingual questions and culminates in a balanced, well-annotated 3,000-question dataset with rich textual and visual contexts. The authors evaluate 25 representative MLLMs, finding that closed-source models generally outperform open-source ones, with top results including 53.4% in mathematics (Claude3.5-Sonnet), 38.2% in physics (GPT-4o), and 47.0% in chemistry (Gemini-1.5-Pro); error analysis shows reasoning as the dominant challenge, especially in interpreting visual information. By comparing VisScience to existing benchmarks and analyzing subject- and language-specific performance, the work demonstrates VisScience’s utility for rigorous measurement of multi-modal scientific reasoning and informs directions for future model improvements and dataset enhancements.

Abstract

Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks by integrating textual and visual information to achieve visual understanding in complex scenarios. Despite the availability of several benchmarks aims to evaluating MLLMs in tasks from visual question answering to complex problem-solving, most focus predominantly on mathematics or general visual understanding tasks. This reveals a critical gap in current benchmarks, which often overlook the inclusion of other key scientific disciplines such as physics and chemistry. To address this gap, we meticulously construct a comprehensive benchmark, named VisScience, which is utilized to assess the multi-modal scientific reasoning across the three disciplines of mathematics, physics, and chemistry. This benchmark comprises 3,000 questions drawn from K12 education - spanning elementary school through high school - equally distributed across three disciplines, with 1,000 questions per discipline. The questions within VisScience span 21 distinct subjects and are categorized into five difficulty levels, offering a broad spectrum of topics within each discipline. With VisScience, we present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning. Experimental results demonstrate that closed-source MLLMs generally outperform open-source models. The best performance observed include a 53.4\% accuracy in mathematics by Claude3.5-Sonnet, 38.2\% in physics by GPT-4o, and 47.0\% in chemistry by Gemini-1.5-Pro. These results underscore the strengths and limitations of MLLMs, suggesting areas for future improvement and highlighting the importance of developing models that can effectively handle the diverse demands of multi-modal scientific reasoning.

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

TL;DR

VisScience fills a critical gap by providing a bilingual, multimodal benchmark that spans mathematics, physics, and chemistry, with 3,000 questions drawn from K12 education across 21 subjects and 5 difficulty levels. It introduces a two-stage data-generation pipeline starting from 450,000 bilingual questions and culminates in a balanced, well-annotated 3,000-question dataset with rich textual and visual contexts. The authors evaluate 25 representative MLLMs, finding that closed-source models generally outperform open-source ones, with top results including 53.4% in mathematics (Claude3.5-Sonnet), 38.2% in physics (GPT-4o), and 47.0% in chemistry (Gemini-1.5-Pro); error analysis shows reasoning as the dominant challenge, especially in interpreting visual information. By comparing VisScience to existing benchmarks and analyzing subject- and language-specific performance, the work demonstrates VisScience’s utility for rigorous measurement of multi-modal scientific reasoning and informs directions for future model improvements and dataset enhancements.

Abstract

Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks by integrating textual and visual information to achieve visual understanding in complex scenarios. Despite the availability of several benchmarks aims to evaluating MLLMs in tasks from visual question answering to complex problem-solving, most focus predominantly on mathematics or general visual understanding tasks. This reveals a critical gap in current benchmarks, which often overlook the inclusion of other key scientific disciplines such as physics and chemistry. To address this gap, we meticulously construct a comprehensive benchmark, named VisScience, which is utilized to assess the multi-modal scientific reasoning across the three disciplines of mathematics, physics, and chemistry. This benchmark comprises 3,000 questions drawn from K12 education - spanning elementary school through high school - equally distributed across three disciplines, with 1,000 questions per discipline. The questions within VisScience span 21 distinct subjects and are categorized into five difficulty levels, offering a broad spectrum of topics within each discipline. With VisScience, we present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning. Experimental results demonstrate that closed-source MLLMs generally outperform open-source models. The best performance observed include a 53.4\% accuracy in mathematics by Claude3.5-Sonnet, 38.2\% in physics by GPT-4o, and 47.0\% in chemistry by Gemini-1.5-Pro. These results underscore the strengths and limitations of MLLMs, suggesting areas for future improvement and highlighting the importance of developing models that can effectively handle the diverse demands of multi-modal scientific reasoning.
Paper Structure (30 sections, 70 figures, 8 tables)

This paper contains 30 sections, 70 figures, 8 tables.

Figures (70)

  • Figure 1: The accuracies of representative MLLMs on VisScience across different subjects and difficulty levels. (Left) The accuracies on different subjects. (Right) The accuracies on various difficulty levels.
  • Figure 2: Examples of the VisScience benchmark comprising three disciplines: mathematics, physics, and chemistry.
  • Figure 3: The distribution of detailed subjects and difficulty levels in the each discipline within the VisScience benchmark. (Left) The distributions of various subjects. (Right) The distributions of difficulty levels.
  • Figure 4: Error distributions of GPT-4o on VisScience across the disciplines of mathematics, physics, and chemistry.
  • Figure 5: Cases of errors from GPT-4o in the disciplines of mathematics, physics, and chemistry.
  • ...and 65 more figures