Scientific Reasoning: Assessment of Multimodal Generative LLMs
Florian Dreyer, Ekaterina Kolos, Daria Matiash
TL;DR
The paper conducts a comprehensive evaluation of multimodal large language models on ScienceQA to understand scientific reasoning across text and visuals. It systematically compares six front-tier LLMs, two adapter-tuned smaller models (via Prefix Tuning and LoRA), and knowledge-distilled students guided by a champion teacher. Key findings show Gemini models achieve the highest accuracy with limited context and align closely with human explanations when provided richer context, while adapter-tuning offers little reliable gains and distillation lags behind training on curated data. The study highlights the importance of input context, data quality, and the current limitations of adapter-based fine-tuning for multimodal scientific reasoning, with practical implications for deploying efficient yet capable models. Future work suggests refining KD strategies, exploring alternative adapters and multi-head architectures to improve reasoning and explanation capabilities.
Abstract
Large language models (LLMs) can answer questions and reason about complex tasks, also from the scientific domain. We assess several multimodal LLMs (MLLMs) on ScienceQA and find that Gemini models show the highest accuracy with little context, and the highest textual similarity to human explanations with richer context. Adapter-tuning of smaller MLLMs did not lead to any reliable performance. Training from Gemini outputs consistently underperformed training from the original data.
