Table of Contents
Fetching ...

SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems

Ziyu Guo, Ray Zhang, Hao Chen, Jialin Gao, Dongzhi Jiang, Jiaze Wang, Pheng-Ann Heng

TL;DR

SciVerse introduces a comprehensive multi-modal benchmark to evaluate large multi-modal models on scientific problems, spanning five problem versions that vary embedded knowledge and the modality of question presentation. A novel scientific CoT evaluation strategy parses intermediate steps into knowledge review and logical deduction to provide fine-grained assessments beyond final accuracy. Across a wide mix of closed- and open-source LMMs, the study finds that closed-source models excel in knowledge grounding and visual perception, but all models are challenged by Vision-only problems and OCR demands, with CoT quality often surpassing final accuracy. The work offers detailed insights into current limitations and lays out a framework for future improvements in scientific knowledge grounding, cross-modal reasoning, and interpretable step-by-step solutions.

Abstract

The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: https://sciverse-cuhk.github.io

SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems

TL;DR

SciVerse introduces a comprehensive multi-modal benchmark to evaluate large multi-modal models on scientific problems, spanning five problem versions that vary embedded knowledge and the modality of question presentation. A novel scientific CoT evaluation strategy parses intermediate steps into knowledge review and logical deduction to provide fine-grained assessments beyond final accuracy. Across a wide mix of closed- and open-source LMMs, the study finds that closed-source models excel in knowledge grounding and visual perception, but all models are challenged by Vision-only problems and OCR demands, with CoT quality often surpassing final accuracy. The work offers detailed insights into current limitations and lays out a framework for future improvements in scientific knowledge grounding, cross-modal reasoning, and interpretable step-by-step solutions.

Abstract

The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: https://sciverse-cuhk.github.io

Paper Structure

This paper contains 28 sections, 14 figures, 1 table.

Figures (14)

  • Figure 1: Overview of Five Problem Versions and our Scientific CoT Evaluation Strategy in SciVerse. To unveil the scientific knowledge comprehension (Top), we first transform each problem into three versions integrating different levels of expertise knowledge. Then, to examine the multi-modal content interpretation (Middle), we further annotate two problem versions with varying vision-language information. We introduce a specialized scientific evaluation strategy (Bottom) to assess the fine-grained reasoning capabilities of LMMs.
  • Figure 2: Key Statistics of SciVerse.
  • Figure 3: Subject Distribution of SciVerse. The dataset contains 2,010 questions from Physics, 1,880 from Chemistry, and 1,845 from Biology.
  • Figure 4: Examples of Five Problem Versions in SciVerse. For each problem in SciVerse, we first create the Knowledge-free version by removing all knowledge content from the question text. Next, we add knowledge cues and details to produce the Knowledge-lite and Knowledge-rich versions. Additionally, starting from the Knowledge-free version, we generate two more versions, Vision-rich and Vision-only, where the given condition and, ultimately, the entire question are transferred to the visual diagram.
  • Figure 5: Examples of the Scientific CoT Evaluation Strategy. For reasoning responses from LMMs, we prompt GPT-4o openai2024gpt4o to perform two evaluation stages, i.e., step categorization and step-wise evaluation. We categorize the intermediate steps into two types: knowledge review and logical reasoning.
  • ...and 9 more figures