Table of Contents
Fetching ...

"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

Pritam Sil, Pushpak Bhattacharyya

TL;DR

This work introduces the Multimodal Short Answer grading with Feedback (MMSAF) problem and a 2,197-point synthetic dataset spanning physics, chemistry, and biology to support scalable, feedback-rich assessment of multimodal student responses. It formalizes LC and IR as core tasks and presents a feedback-generation component that requires cross-modal reasoning. A data-generation framework is proposed, and a baseline evaluation of four LLMs (ChatGPT, Gemini, Pixtral, Molmo) reveals domain-dependent strengths: Gemini leads LC while ChatGPT excels at IR, with Pixtral performing strongly in biology per expert judgments. The study highlights the potential of MMSAF for scalable educational feedback, while acknowledging limitations of synthetic data and pointing to future directions such as retrieval-augmented generation to deepen conceptual feedback.

Abstract

Assessments play a vital role in a student's learning process. This is because they provide valuable feedback crucial to a student's growth. Such assessments contain questions with open-ended responses, which are difficult to grade at scale. These responses often require students to express their understanding through textual and visual elements together as a unit. In order to develop scalable assessment tools for such questions, one needs multimodal LLMs having strong comparative reasoning capabilities across multiple modalities. Thus, to facilitate research in this area, we propose the Multimodal Short Answer grading with Feedback (MMSAF) problem along with a dataset of 2,197 data points. Additionally, we provide an automated framework for generating such datasets. As per our evaluations, existing Multimodal Large Language Models (MLLMs) could predict whether an answer is correct, incorrect or partially correct with an accuracy of 55%. Similarly, they could predict whether the image provided in the student's answer is relevant or not with an accuracy of 75%. As per human experts, Pixtral was more aligned towards human judgement and values for biology and ChatGPT for physics and chemistry and achieved a score of 4 or more out of 5 in most parameters.

"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

TL;DR

This work introduces the Multimodal Short Answer grading with Feedback (MMSAF) problem and a 2,197-point synthetic dataset spanning physics, chemistry, and biology to support scalable, feedback-rich assessment of multimodal student responses. It formalizes LC and IR as core tasks and presents a feedback-generation component that requires cross-modal reasoning. A data-generation framework is proposed, and a baseline evaluation of four LLMs (ChatGPT, Gemini, Pixtral, Molmo) reveals domain-dependent strengths: Gemini leads LC while ChatGPT excels at IR, with Pixtral performing strongly in biology per expert judgments. The study highlights the potential of MMSAF for scalable educational feedback, while acknowledging limitations of synthetic data and pointing to future directions such as retrieval-augmented generation to deepen conceptual feedback.

Abstract

Assessments play a vital role in a student's learning process. This is because they provide valuable feedback crucial to a student's growth. Such assessments contain questions with open-ended responses, which are difficult to grade at scale. These responses often require students to express their understanding through textual and visual elements together as a unit. In order to develop scalable assessment tools for such questions, one needs multimodal LLMs having strong comparative reasoning capabilities across multiple modalities. Thus, to facilitate research in this area, we propose the Multimodal Short Answer grading with Feedback (MMSAF) problem along with a dataset of 2,197 data points. Additionally, we provide an automated framework for generating such datasets. As per our evaluations, existing Multimodal Large Language Models (MLLMs) could predict whether an answer is correct, incorrect or partially correct with an accuracy of 55%. Similarly, they could predict whether the image provided in the student's answer is relevant or not with an accuracy of 75%. As per human experts, Pixtral was more aligned towards human judgement and values for biology and ChatGPT for physics and chemistry and achieved a score of 4 or more out of 5 in most parameters.
Paper Structure (28 sections, 1 equation, 12 figures, 7 tables)

This paper contains 28 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Illustration of the MMSAF problem with an example. (Image source for heart diagram: https://edurev.in/t/131714/STRUCTURE-OF-HUMAN-HEART)
  • Figure 2: An automatic framework to generate the MMSAF dataset
  • Figure 3: Confusion Matrix for Gemini after True Class Normalization
  • Figure 4: Confusion Matrix for ChatGPT after True Class Normalization
  • Figure 5: Confusion Matrix for Pixtral after True Class Normalization
  • ...and 7 more figures