Table of Contents
Fetching ...

MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting

Avinash Anand, Janak Kapuriya, Apoorv Singh, Jay Saraf, Naman Lal, Astha Verma, Rushali Gupta, Rajiv Shah

TL;DR

This work introduces MM-PhyQA, a high-quality multimodal, high school physics question dataset designed to benchmark open-source large multimodal models on multi-step reasoning tasks. It proposes Multi-Image Chain-of-Thought (MI-CoT) prompting to integrate multiple images per question, and demonstrates that MI-CoT with LLaVA-1.5 13b achieves the strongest performance, reaching $71.65\%$ accuracy on the test set. The study systematically analyzes the impact of modality, fine-tuning, and chain-of-thought prompting, finding that multimodal models with MI-CoT substantially outperform text-only baselines and zero-shot prompting. The results highlight the potential of open-source multimodal models for educational AI and provide a foundation for further enhancements, such as RLHF-based alignment and extending MI-CoT to additional multimodal tasks.

Abstract

While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.

MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting

TL;DR

This work introduces MM-PhyQA, a high-quality multimodal, high school physics question dataset designed to benchmark open-source large multimodal models on multi-step reasoning tasks. It proposes Multi-Image Chain-of-Thought (MI-CoT) prompting to integrate multiple images per question, and demonstrates that MI-CoT with LLaVA-1.5 13b achieves the strongest performance, reaching accuracy on the test set. The study systematically analyzes the impact of modality, fine-tuning, and chain-of-thought prompting, finding that multimodal models with MI-CoT substantially outperform text-only baselines and zero-shot prompting. The results highlight the potential of open-source multimodal models for educational AI and provide a foundation for further enhancements, such as RLHF-based alignment and extending MI-CoT to additional multimodal tasks.

Abstract

While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.
Paper Structure (19 sections, 2 equations, 5 figures, 3 tables)

This paper contains 19 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Schematic Pipeline of Multimodal Question Answering
  • Figure 2: MMPhy-QA Dataset questions
  • Figure 3: Multi-Image Chain of thought (MI-CoT) Prompted text provided as input to LMMs during training. The main question to be answered is preceded by two exemplars, with the three questions separated by a delimiter. The image is a sequence of three comma-separated file names and the label is the ground truth
  • Figure 4: Comparison of the accuracy and rouge scores of different LLaVA variants when trained using (MI-CoT) Prompting vs their non-CoT prompted supervised fine-tuned (SFT) counterparts
  • Figure 5: Types of errors encountered by LLaVA-1.5 13b