Table of Contents
Fetching ...

SCI-Reason: A Dataset with Chain-of-Thought Rationales for Complex Multimodal Reasoning in Academic Areas

Chenghao Ma, Haihong E., Junpeng Ding, Jun Zhang, Ziyan Ma, Huang Qing, Bofei Gao, Liang Chen, Yifan Zhu, Meina Song

TL;DR

SCI-Reason tackles the challenge of robust reasoning over complex academic multimodal imagery by introducing a PubMed-derived dataset with 12,066 images and 12,626 QA pairs annotated with verifiable chain-of-thoughts generated via Monte Carlo Tree Search. The work evaluates eight foundation models, demonstrates that multi-step inference is the main source of errors, and shows that finetuning Qwen2-VL-7B with the SCI-Reason data yields substantial gains and cross-domain generalization to related domains. It also establishes a rigorous evaluation protocol using ACC, ANLS, and WUPS, and highlights the practical value of reasoning traces for tasks such as detecting inconsistencies and potential research fraud. Overall, SCI-Reason provides a principled benchmark and training resource to advance reliable multimodal reasoning in authentic scientific contexts.

Abstract

Large Language Models (LLMs) and Large Multimodal Models (LMMs) demonstrate impressive problem-solving skills in many tasks and domains. However, their ability to reason with complex images in academic domains has not been systematically investigated. To bridge this gap, we present SCI-Reason, a dataset for complex multimodel reasoning in academic areas. SCI-Reason aims to test and improve the reasoning ability of large multimodal models using real complex images in academic domains. The dataset contains 12,066 images and 12,626 question-answer pairs extracted from PubMed, divided into training, validation and test splits. Each question-answer pair also contains an accurate and efficient inference chain as a guide to improving the inference properties of the dataset. With SCI-Reason, we performed a comprehensive evaluation of 8 well-known models. The best performing model, Claude-3.7-Sonnet, only achieved an accuracy of 55.19%. Error analysis shows that more than half of the model failures are due to breakdowns in multi-step inference chains rather than errors in primary visual feature extraction. This finding underscores the inherent limitations in reasoning capabilities exhibited by current multimodal models when processing complex image analysis tasks within authentic academic contexts. Experiments on open-source models show that SCI-Reason not only enhances reasoning ability but also demonstrates cross-domain generalization in VQA tasks. We also explore future applications of model inference capabilities in this domain, highlighting its potential for future research.

SCI-Reason: A Dataset with Chain-of-Thought Rationales for Complex Multimodal Reasoning in Academic Areas

TL;DR

SCI-Reason tackles the challenge of robust reasoning over complex academic multimodal imagery by introducing a PubMed-derived dataset with 12,066 images and 12,626 QA pairs annotated with verifiable chain-of-thoughts generated via Monte Carlo Tree Search. The work evaluates eight foundation models, demonstrates that multi-step inference is the main source of errors, and shows that finetuning Qwen2-VL-7B with the SCI-Reason data yields substantial gains and cross-domain generalization to related domains. It also establishes a rigorous evaluation protocol using ACC, ANLS, and WUPS, and highlights the practical value of reasoning traces for tasks such as detecting inconsistencies and potential research fraud. Overall, SCI-Reason provides a principled benchmark and training resource to advance reliable multimodal reasoning in authentic scientific contexts.

Abstract

Large Language Models (LLMs) and Large Multimodal Models (LMMs) demonstrate impressive problem-solving skills in many tasks and domains. However, their ability to reason with complex images in academic domains has not been systematically investigated. To bridge this gap, we present SCI-Reason, a dataset for complex multimodel reasoning in academic areas. SCI-Reason aims to test and improve the reasoning ability of large multimodal models using real complex images in academic domains. The dataset contains 12,066 images and 12,626 question-answer pairs extracted from PubMed, divided into training, validation and test splits. Each question-answer pair also contains an accurate and efficient inference chain as a guide to improving the inference properties of the dataset. With SCI-Reason, we performed a comprehensive evaluation of 8 well-known models. The best performing model, Claude-3.7-Sonnet, only achieved an accuracy of 55.19%. Error analysis shows that more than half of the model failures are due to breakdowns in multi-step inference chains rather than errors in primary visual feature extraction. This finding underscores the inherent limitations in reasoning capabilities exhibited by current multimodal models when processing complex image analysis tasks within authentic academic contexts. Experiments on open-source models show that SCI-Reason not only enhances reasoning ability but also demonstrates cross-domain generalization in VQA tasks. We also explore future applications of model inference capabilities in this domain, highlighting its potential for future research.

Paper Structure

This paper contains 12 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of conventional reasoning tasks and academic complex multimodel reasoning tasks
  • Figure 1: Statistics of Scientific Reasoning Question Types
  • Figure 2: Overview of the dataset construction process, figure (a) illustrates the metadata collection process, figure (b) depicts the construction of question-answer pairs, figure (c) shows the generation of chain-of-thought annotations
  • Figure 3: Examples of the five classified tasks in our dataset:multimodal temporal reasoning, professional entity location reasoning, cross-subgraph role reasoning, causal mechanism reasoning and methodological technical reasoning
  • Figure 4: Analysis of model answer errors, the errors are categorized into four main types