"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

Pritam Sil; Pushpak Bhattacharyya

"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

Pritam Sil, Pushpak Bhattacharyya

TL;DR

This work introduces the Multimodal Short Answer grading with Feedback (MMSAF) problem and a 2,197-point synthetic dataset spanning physics, chemistry, and biology to support scalable, feedback-rich assessment of multimodal student responses. It formalizes LC and IR as core tasks and presents a feedback-generation component that requires cross-modal reasoning. A data-generation framework is proposed, and a baseline evaluation of four LLMs (ChatGPT, Gemini, Pixtral, Molmo) reveals domain-dependent strengths: Gemini leads LC while ChatGPT excels at IR, with Pixtral performing strongly in biology per expert judgments. The study highlights the potential of MMSAF for scalable educational feedback, while acknowledging limitations of synthetic data and pointing to future directions such as retrieval-augmented generation to deepen conceptual feedback.

Abstract

Assessments play a vital role in a student's learning process. This is because they provide valuable feedback crucial to a student's growth. Such assessments contain questions with open-ended responses, which are difficult to grade at scale. These responses often require students to express their understanding through textual and visual elements together as a unit. In order to develop scalable assessment tools for such questions, one needs multimodal LLMs having strong comparative reasoning capabilities across multiple modalities. Thus, to facilitate research in this area, we propose the Multimodal Short Answer grading with Feedback (MMSAF) problem along with a dataset of 2,197 data points. Additionally, we provide an automated framework for generating such datasets. As per our evaluations, existing Multimodal Large Language Models (MLLMs) could predict whether an answer is correct, incorrect or partially correct with an accuracy of 55%. Similarly, they could predict whether the image provided in the student's answer is relevant or not with an accuracy of 75%. As per human experts, Pixtral was more aligned towards human judgement and values for biology and ChatGPT for physics and chemistry and achieved a score of 4 or more out of 5 in most parameters.

"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

TL;DR

Abstract

Paper Structure (28 sections, 1 equation, 12 figures, 7 tables)

This paper contains 28 sections, 1 equation, 12 figures, 7 tables.

Introduction
Related Work
The Multimodal Short Answer Grading with Feedback (MMSAF) Problem
Classification of Level of Correctness and Image Relevance
Feedback Generation
Multimodal Short Answer Grading with Feedback (MMSAF) Dataset
Generation of Textual and Image Segments of Student Answers
Generation of Level of Correctness, Image Relevance and Rubrics
LLMs in Consideration
Evaluation of LLM Generated Feedback
Analysis of Correctness and Relevance levels
Evaluation Task for Experts
Analysis of Expert Evaluation
Conclusion
Prompt used for synthetically generating correct responses
...and 13 more sections

Figures (12)

Figure 1: Illustration of the MMSAF problem with an example. (Image source for heart diagram: https://edurev.in/t/131714/STRUCTURE-OF-HUMAN-HEART)
Figure 2: An automatic framework to generate the MMSAF dataset
Figure 3: Confusion Matrix for Gemini after True Class Normalization
Figure 4: Confusion Matrix for ChatGPT after True Class Normalization
Figure 5: Confusion Matrix for Pixtral after True Class Normalization
...and 7 more figures

"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

TL;DR

Abstract

"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

Authors

TL;DR

Abstract

Table of Contents

Figures (12)