ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges
Cheng Qian, Hongyi Du, Hongru Wang, Xiusi Chen, Yuji Zhang, Avirup Sil, Chengxiang Zhai, Kathleen McKeown, Heng Ji
TL;DR
ModelingBench provides real-world, open-ended math-modeling problems drawn from COMAP contests, challenging LLMs to translate natural language into formal mathematical formulations, data-driven analyses, and defensible reports. ModelingAgent, a four-agent system with a shared memory and a Critic, enables iterative self-improvement and coordinated tool use to tackle these tasks, while ModelingJudge offers expert-in-the-loop evaluation to mirror real-world judging practices. Across extensive experiments, ModelingAgent outperforms vanilla and tool-based baselines by up to 20% and approaches human performance in several metrics, though deeper innovativeness and robust data-grounded reasoning remain challenging. The work argues for a practical, interpretable, and extensible framework to evaluate and advance real-world problem-solving in interdisciplinary modeling, with potential extensions to multi-modal inputs and stronger human-in-the-loop integration.
Abstract
Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect the complexity of real-world problems, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce ModelingBench, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench also supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. We also present ModelingAgent, a multi-agent framework that coordinates tool use, supports structured workflows, and enables iterative self-refinement to generate well-grounded, creative solutions. To evaluate outputs, we further propose ModelingJudge, an expert-in-the-loop system leveraging LLMs as domain-specialized judges assessing solutions from multiple expert perspectives. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real-world problem-solving in open-ended, interdisciplinary modeling challenges.
