Table of Contents
Fetching ...

mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

Arka Mukherjee, Shreya Ghosh

TL;DR

mmJEE-Eval introduces a bilingual, multimodal benchmark derived from seven years of JEE Advanced to scrutinize scientific reasoning in vision-language models. By combining English and Hindi prompts with image-enabled questions and a rigorous multi-source ground-truth process, it reveals pronounced gaps between open-source and closed-frontier models, particularly in cross-lingual consistency and metacognition. The framework includes contamination tests, detailed ablations, and a human-in-the-loop error analysis, demonstrating that accuracy alone underestimates reasoning quality and self-correction ability. The work provides a valuable, domain-specific diagnostic tool for evaluating and advancing robust, multilingual scientific reasoning in VLMs with practical implications for education-oriented AI systems.

Abstract

Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85\% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbf{mmJEE-Eval}, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84\% accuracy on held-out 2025 questions, open-source models plateau at 37-45\% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100\% pass@3 scores), they fully collapse when the reasoning load is increased meta-cognitively (GPT-5 fixes just 5.2\% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://mmjee-eval.github.io

mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

TL;DR

mmJEE-Eval introduces a bilingual, multimodal benchmark derived from seven years of JEE Advanced to scrutinize scientific reasoning in vision-language models. By combining English and Hindi prompts with image-enabled questions and a rigorous multi-source ground-truth process, it reveals pronounced gaps between open-source and closed-frontier models, particularly in cross-lingual consistency and metacognition. The framework includes contamination tests, detailed ablations, and a human-in-the-loop error analysis, demonstrating that accuracy alone underestimates reasoning quality and self-correction ability. The work provides a valuable, domain-specific diagnostic tool for evaluating and advancing robust, multilingual scientific reasoning in VLMs with practical implications for education-oriented AI systems.

Abstract

Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85\% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbf{mmJEE-Eval}, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84\% accuracy on held-out 2025 questions, open-source models plateau at 37-45\% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100\% pass@3 scores), they fully collapse when the reasoning load is increased meta-cognitively (GPT-5 fixes just 5.2\% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://mmjee-eval.github.io

Paper Structure

This paper contains 29 sections, 24 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Example problem and response from mmJEE-Eval. Despite mathematical correctness, the model incorrectly assumes uniform thickness ("same physical thickness $t$ of glass") when the figure clearly shows wedge-shaped glass pieces with varying thickness. This multimodal reasoning failure demonstrates the multiple dimensions our proposed benchmark tests.
  • Figure 2: Heatmap comparing accuracy across languages (English, Hindi) for four language models. Gemini 2.5 Pro achieves consistently high performance (>0.80), while smaller models show uniformly low performance ( 0.10). Color scale represents accuracy from 0.1 (light blue) to 0.8+ (dark red).
  • Figure 3: Thresholding analysis comparing GPT-5 and InternVL3-78B performance. (a) shows the thresholding behavior for GPT-5, while (b) demonstrates the corresponding analysis for InternVL3-78B. We note more stable behavior on the OpenAI model than InternVL3, denoting that correct answers were generated more consistently across 10 runs.
  • Figure 4: Accuracy characteristics of two open-source models—Gemma3 27B and Granite Vision 3.2 2B—on mmJEE-Eval. Gemma3 27B (left) shows consistent performance with minimal variance around 29.63% mean accuracy. Granite Vision 3.2 2B (right) exhibits erratic behavior resembling random guessing, with accuracy fluctuations between 5.88% and 29.41%, making it unsuitable for reliable evaluation.
  • Figure 5: Analysis of confidence interval convergence across multiple experimental runs. The top panel shows how the 95% confidence interval width decreases as the number of runs increases, demonstrating improved statistical precision. The bottom panel displays the mean accuracy with confidence bounds, showing stabilization of the estimate around 29.62% after sufficient runs.
  • ...and 5 more figures