Table of Contents
Fetching ...

How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

Yuhang Liu, Heyan Huang, Yizhe Yang, Hongyan Zhao, Zhizhuo Zeng, Yang Gao

Abstract

Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework's reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.

How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

Abstract

Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework's reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.

Paper Structure

This paper contains 84 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Problem-oriented, stage-wise evaluation framework. An LLM first decomposes a mathematical modeling problem into a set of subtasks, which are guided and refined by domain experts to preserve the original intent. For each verified subtask, a stage-aware evaluation rubric is instantiated by conditioning on the major stages of the modeling process. For each subtask--stage pair, the LLM generates concrete evaluation criteria under expert guidance, which serve as the atomic units for scoring and are further verified and revised. The resulting subtask--stage rubric defines a fixed, problem-specific evaluation structure applied uniformly across models.
  • Figure 2: Stage-wise agreement between automatic and human expert scores measured by ICC(2,1). Higher values indicate stronger alignment with expert assessment. The boxplots correspond to individual modeling stages (Prb Idf, Prb Frm, Asm Dev, Mod Con, Mod Sol, Cod Imp, Res Ays), following the standard mathematical modeling workflow.
  • Figure 3: Distribution of evaluation scores under the baseline rubric and our framework. Base Avg denotes the report-level baseline score, and Our Avg denotes the overall score under our problem-oriented, stage-aware framework. The remaining boxplots correspond to individual modeling stages (Prb Idf, Prb Frm, Asm Dev, Mod Con, Mod Sol, Cod Imp, Res Ays).
  • Figure 4: Stage-wise performance of four LLMs across the mathematical modeling pipeline. DS-Inst and DS-Think denote two DeepSeek-V3.2 variants, and Qw3-235B denotes a Qwen3-235B-Instruct model. Scores are averaged over reports and subtasks for each stage.
  • Figure 5: Stage-wise performance under within-family scaling for Qwen models. Bars correspond to Qw-7B--Qw-72B (Qwen2.5) and Qw-235B (Qwen3-235B).
  • ...and 1 more figures