Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation
Jaewoo Park, Jungyang Park, Dongju Jang, Jiwan Chung, Byungwoo Yoo, Jaewoo Shin, Seonjoon Park, Taehyeong Kim, Youngjae Yu
TL;DR
This paper introduces ME2, a benchmark for multimodal solution explanations in math education, requiring models to identify visual keypoints and generate explanations anchored to those keypoints. ME2 comprises 1,000 problem–solution instances with problem/solution images, visual keypoints, and a concise explanatory direction, organized into two tasks: Visual Keypoint Identification and Keypoint-based Explanation Generation. Comprehensive experiments across generalist, math-specialized, and proprietary models reveal large gaps in visual grounding and the ability to produce visually grounded educational explanations, with proprietary models performing best but still far from solving the challenge. The work demonstrates the need for improved mathematically grounded visual understanding in AI tutors and offers a benchmark to catalyze progress in educational multimodal AI.
Abstract
With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students' comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that current models struggle to identify visual keypoints. In the task of generating keypoint-based explanations, open-source models also face notable difficulties. This highlights a significant gap in current LLMs' ability to perform mathematical visual grounding, engage in visually grounded reasoning, and provide explanations in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.
