VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Hanoona Rasheed; Abdelrahman Shaker; Anqi Tang; Muhammad Maaz; Ming-Hsuan Yang; Salman Khan; Fahad Shahbaz Khan

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Shahbaz Khan

TL;DR

VideoMathQA introduces a multimodal, temporally extended benchmark for mathematical reasoning in real-world instructional videos. The dataset comprises 420 video–question pairs across 10 domains with 2,945 expert-annotated reasoning steps, capturing direct problem solving, concept transfer, and deep instructional comprehension. It provides high-resolution video, aligned subtitles, and audio, plus four rigorous evaluation strategies (MCQ, MBin, CoT, and step-wise evaluation) to diagnose intermediate reasoning and final answers. Findings show that success hinges on long-range cross-modal grounding, with model size, architecture, and reasoning capabilities influencing performance, and subtitles and frame-rich inputs providing measurable gains. This work delivers a systematic framework and dataset to push toward genuine temporal, multimodal mathematical reasoning.

Abstract

Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

TL;DR

Abstract

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)