Table of Contents
Fetching ...

Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

Mohamed Huti, Alasdair Mackintosh, Amy Waldock, Dominic Andrews, Maxime Lelièvre, Moritz Boos, Tobias Murray, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod

TL;DR

The paper addresses the gap in visual reasoning capabilities of AI systems for classroom tasks by introducing the Visual Reasoning Benchmark (VRB), a dataset of 701 minimal-text, classroom-authentic visual questions sourced from LMIC primary exams in Zambia and India. It evaluates 45 multimodal systems, analyzes performance across task formats and visual skills, and reveals a jagged frontier where static skills are easier while dynamic spatial transformations (folding, rotation, reflection) remain challenging, often exacerbated by artefacts. A key contribution is the identification of a spatial ceiling and the need for human oversight in deployment, along with a detailed methodology for dataset curation, annotation, and evaluation that can guide safer, more effective educational AI tools. The work emphasizes the practical impact for classroom use in LMIC settings and provides a framework, prompts, and code to reproduce and extend the benchmark while advocating process-aware evaluation and explanations to improve reasoning fidelity.

Abstract

AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.

Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

TL;DR

The paper addresses the gap in visual reasoning capabilities of AI systems for classroom tasks by introducing the Visual Reasoning Benchmark (VRB), a dataset of 701 minimal-text, classroom-authentic visual questions sourced from LMIC primary exams in Zambia and India. It evaluates 45 multimodal systems, analyzes performance across task formats and visual skills, and reveals a jagged frontier where static skills are easier while dynamic spatial transformations (folding, rotation, reflection) remain challenging, often exacerbated by artefacts. A key contribution is the identification of a spatial ceiling and the need for human oversight in deployment, along with a detailed methodology for dataset curation, annotation, and evaluation that can guide safer, more effective educational AI tools. The work emphasizes the practical impact for classroom use in LMIC settings and provides a framework, prompts, and code to reproduce and extend the benchmark while advocating process-aware evaluation and explanations to improve reasoning fidelity.

Abstract

AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.
Paper Structure (26 sections, 12 figures)

This paper contains 26 sections, 12 figures.

Figures (12)

  • Figure 1: Question examples showing the six task categories in our Visual Reasoning Benchmark.
  • Figure 2: Question examples showing the ten skill tags in our Visual Reasoning Benchmark.
  • Figure 3: Question examples showing the two types of error in our Visual Reasoning Benchmark.
  • Figure 4: Accuracy on VRB for a subset of models. Errors show 95% bootstrap confidence intervals.
  • Figure 5: Accuracy on the Visual Reasoning Benchmark (VRB) by Weights Availability.
  • ...and 7 more figures