Table of Contents
Fetching ...

Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Jaewoo Park, Jungyang Park, Dongju Jang, Jiwan Chung, Byungwoo Yoo, Jaewoo Shin, Seonjoon Park, Taehyeong Kim, Youngjae Yu

TL;DR

This paper introduces ME2, a benchmark for multimodal solution explanations in math education, requiring models to identify visual keypoints and generate explanations anchored to those keypoints. ME2 comprises 1,000 problem–solution instances with problem/solution images, visual keypoints, and a concise explanatory direction, organized into two tasks: Visual Keypoint Identification and Keypoint-based Explanation Generation. Comprehensive experiments across generalist, math-specialized, and proprietary models reveal large gaps in visual grounding and the ability to produce visually grounded educational explanations, with proprietary models performing best but still far from solving the challenge. The work demonstrates the need for improved mathematically grounded visual understanding in AI tutors and offers a benchmark to catalyze progress in educational multimodal AI.

Abstract

With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students' comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that current models struggle to identify visual keypoints. In the task of generating keypoint-based explanations, open-source models also face notable difficulties. This highlights a significant gap in current LLMs' ability to perform mathematical visual grounding, engage in visually grounded reasoning, and provide explanations in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.

Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

TL;DR

This paper introduces ME2, a benchmark for multimodal solution explanations in math education, requiring models to identify visual keypoints and generate explanations anchored to those keypoints. ME2 comprises 1,000 problem–solution instances with problem/solution images, visual keypoints, and a concise explanatory direction, organized into two tasks: Visual Keypoint Identification and Keypoint-based Explanation Generation. Comprehensive experiments across generalist, math-specialized, and proprietary models reveal large gaps in visual grounding and the ability to produce visually grounded educational explanations, with proprietary models performing best but still far from solving the challenge. The work demonstrates the need for improved mathematically grounded visual understanding in AI tutors and offers a benchmark to catalyze progress in educational multimodal AI.

Abstract

With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students' comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that current models struggle to identify visual keypoints. In the task of generating keypoint-based explanations, open-source models also face notable difficulties. This highlights a significant gap in current LLMs' ability to perform mathematical visual grounding, engage in visually grounded reasoning, and provide explanations in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.

Paper Structure

This paper contains 38 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: A student solving a math problem often benefits from visual cues—such as lines, symbols, or highlights—that human instructors use to aid understanding, unlike current AI models that focus solely on textual solutions. To serve as effective educational assistants, machines must go beyond answer generation and emulate human-like explanation strategies by explicitly incorporating and referencing visual elements.
  • Figure 2: An overview of the ME2 benchmark. The ME2 consists of multimodal problem–solution pairs curated from real-world educational settings, along with visual keypoints and explanation summaries generated through a Human–AI annotation.
  • Figure 3: We propose two subtasks to robustly analyze multimodal solution explanation capacity: (1) Visual Keypoint Identification, which challenges machines to recognize visual keypoints useful for subsequent explanation, and (2) Keypoint-based Explanation Generation, which requires models to generate explanations that explicitly reference the identified visual keypoints.
  • Figure 4: Topic coverage of geometry and graph across 17 chapters in the ME2 benchmark.
  • Figure 5: Examples of reasoning processes and final predictions produced by Qwen2.5-VL 7B, Math-PUMA, and Gemini 2.0 Flash on the Visual Keypoint Identification task. Qwen2.5-VL demonstrates task understanding and reasoning but produces an incorrect answer, Math-PUMA lacks both, while Gemini 2.0 Flash demonstrates both and produces the correct answer.
  • ...and 8 more figures