Table of Contents
Fetching ...

Scientific Reasoning: Assessment of Multimodal Generative LLMs

Florian Dreyer, Ekaterina Kolos, Daria Matiash

TL;DR

The paper conducts a comprehensive evaluation of multimodal large language models on ScienceQA to understand scientific reasoning across text and visuals. It systematically compares six front-tier LLMs, two adapter-tuned smaller models (via Prefix Tuning and LoRA), and knowledge-distilled students guided by a champion teacher. Key findings show Gemini models achieve the highest accuracy with limited context and align closely with human explanations when provided richer context, while adapter-tuning offers little reliable gains and distillation lags behind training on curated data. The study highlights the importance of input context, data quality, and the current limitations of adapter-based fine-tuning for multimodal scientific reasoning, with practical implications for deploying efficient yet capable models. Future work suggests refining KD strategies, exploring alternative adapters and multi-head architectures to improve reasoning and explanation capabilities.

Abstract

Large language models (LLMs) can answer questions and reason about complex tasks, also from the scientific domain. We assess several multimodal LLMs (MLLMs) on ScienceQA and find that Gemini models show the highest accuracy with little context, and the highest textual similarity to human explanations with richer context. Adapter-tuning of smaller MLLMs did not lead to any reliable performance. Training from Gemini outputs consistently underperformed training from the original data.

Scientific Reasoning: Assessment of Multimodal Generative LLMs

TL;DR

The paper conducts a comprehensive evaluation of multimodal large language models on ScienceQA to understand scientific reasoning across text and visuals. It systematically compares six front-tier LLMs, two adapter-tuned smaller models (via Prefix Tuning and LoRA), and knowledge-distilled students guided by a champion teacher. Key findings show Gemini models achieve the highest accuracy with limited context and align closely with human explanations when provided richer context, while adapter-tuning offers little reliable gains and distillation lags behind training on curated data. The study highlights the importance of input context, data quality, and the current limitations of adapter-based fine-tuning for multimodal scientific reasoning, with practical implications for deploying efficient yet capable models. Future work suggests refining KD strategies, exploring alternative adapters and multi-head architectures to improve reasoning and explanation capabilities.

Abstract

Large language models (LLMs) can answer questions and reason about complex tasks, also from the scientific domain. We assess several multimodal LLMs (MLLMs) on ScienceQA and find that Gemini models show the highest accuracy with little context, and the highest textual similarity to human explanations with richer context. Adapter-tuning of smaller MLLMs did not lead to any reliable performance. Training from Gemini outputs consistently underperformed training from the original data.

Paper Structure

This paper contains 41 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Intuition behind Prefix Tuning. Source: li2021prefix
  • Figure 2: Accuracy scores in answer generation by LLMs. Benchmarking.
  • Figure 3: Overall scores in reasoning by LLMs. Benchmarking.
  • Figure 4: Overall score in reasoning by LLMs in base, after Prefix-tuning, and LoRA Adapter-tuning.
  • Figure 5: Overall score in reasoning by LLMs.