Table of Contents
Fetching ...

CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning

Hamed Mahdavi, Pouria Mahdavinia, Alireza Farhadi, Pegah Mohammadipour, Samira Malek, Majid Daliri, Pedram Mohammadipour, Alireza Hashemi, Amir Khasahmadi, Vasant Honavar

TL;DR

This work introduces CombiGraph-Vis, a multimodal benchmark for discrete mathematical reasoning with 1135 problems across 13 domains and 3 formats, including $35\%$ image-tagged items and verified solutions with technique labels. It evaluates LLMs on proof-analysis and grading tasks using a two-pronged data pipeline, incorporating 90 Gemini 2.5 Pro-generated solutions graded on a $1$–$4$ scale with error annotations and MathArena solutions scored on a $0$–$7$ scale to enable fine-grained assessment. A two-phase, agentic data-curation workflow addresses data quality and rubric derivation, yielding higher human-alignment and better partial-credit handling across formats. The results highlight strong model separations, persistent multimodal reasoning gaps, and distractor susceptibility, while the framework and released resources lay a foundation for robust, rubrics-driven multimodal discrete-math evaluation and progress.

Abstract

State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.

CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning

TL;DR

This work introduces CombiGraph-Vis, a multimodal benchmark for discrete mathematical reasoning with 1135 problems across 13 domains and 3 formats, including image-tagged items and verified solutions with technique labels. It evaluates LLMs on proof-analysis and grading tasks using a two-pronged data pipeline, incorporating 90 Gemini 2.5 Pro-generated solutions graded on a scale with error annotations and MathArena solutions scored on a scale to enable fine-grained assessment. A two-phase, agentic data-curation workflow addresses data quality and rubric derivation, yielding higher human-alignment and better partial-credit handling across formats. The results highlight strong model separations, persistent multimodal reasoning gaps, and distractor susceptibility, while the framework and released resources lay a foundation for robust, rubrics-driven multimodal discrete-math evaluation and progress.

Abstract

State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.

Paper Structure

This paper contains 34 sections, 3 figures, 2 tables, 3 algorithms.

Figures (3)

  • Figure 2: Illustrative toy example of a context-dependent problem.
  • Figure 3: Per-model accuracy by topic (%). Best score per topic is highlighted in bold within each cell.
  • Figure :