SCI-Reason: A Dataset with Chain-of-Thought Rationales for Complex Multimodal Reasoning in Academic Areas
Chenghao Ma, Haihong E., Junpeng Ding, Jun Zhang, Ziyan Ma, Huang Qing, Bofei Gao, Liang Chen, Yifan Zhu, Meina Song
TL;DR
SCI-Reason tackles the challenge of robust reasoning over complex academic multimodal imagery by introducing a PubMed-derived dataset with 12,066 images and 12,626 QA pairs annotated with verifiable chain-of-thoughts generated via Monte Carlo Tree Search. The work evaluates eight foundation models, demonstrates that multi-step inference is the main source of errors, and shows that finetuning Qwen2-VL-7B with the SCI-Reason data yields substantial gains and cross-domain generalization to related domains. It also establishes a rigorous evaluation protocol using ACC, ANLS, and WUPS, and highlights the practical value of reasoning traces for tasks such as detecting inconsistencies and potential research fraud. Overall, SCI-Reason provides a principled benchmark and training resource to advance reliable multimodal reasoning in authentic scientific contexts.
Abstract
Large Language Models (LLMs) and Large Multimodal Models (LMMs) demonstrate impressive problem-solving skills in many tasks and domains. However, their ability to reason with complex images in academic domains has not been systematically investigated. To bridge this gap, we present SCI-Reason, a dataset for complex multimodel reasoning in academic areas. SCI-Reason aims to test and improve the reasoning ability of large multimodal models using real complex images in academic domains. The dataset contains 12,066 images and 12,626 question-answer pairs extracted from PubMed, divided into training, validation and test splits. Each question-answer pair also contains an accurate and efficient inference chain as a guide to improving the inference properties of the dataset. With SCI-Reason, we performed a comprehensive evaluation of 8 well-known models. The best performing model, Claude-3.7-Sonnet, only achieved an accuracy of 55.19%. Error analysis shows that more than half of the model failures are due to breakdowns in multi-step inference chains rather than errors in primary visual feature extraction. This finding underscores the inherent limitations in reasoning capabilities exhibited by current multimodal models when processing complex image analysis tasks within authentic academic contexts. Experiments on open-source models show that SCI-Reason not only enhances reasoning ability but also demonstrates cross-domain generalization in VQA tasks. We also explore future applications of model inference capabilities in this domain, highlighting its potential for future research.
