Table of Contents
Fetching ...

ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling

Zhuohan Wang, Ziwei Zhu, Ziniu Li, Congliang Chen, Yizhou Han, Yufeng Lin, Zhihang Lin, Angyang Gu, Xinglin Hu, Ruoyu Sun, Tian Ding

TL;DR

ORGEval addresses the challenge of evaluating LLMs in optimization modeling where solver-based metrics are unreliable; it models optimization problems as graphs and reduces equivalence to graph isomorphism, exploiting a symmetric decomposable SD condition with a tailored WL test. The paper formalizes model equivalence, introduces Bench4Opt dataset (model-data separation), and demonstrates that ORGEval achieves 100% consistency across data configurations and substantial runtime gains over solvers, while benchmarking LLMs where DeepSeek-V3 and Claude-Opus-4 lead in accuracy under direct prompting. The results provide a principled, scalable framework for assessing LLMs in optimization modeling with practical impact for industrial problem formulation.

Abstract

Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult due to the absence of robust metrics. Existing solver-based approaches often face inconsistency, infeasibility issues, and high computational costs. To address these issues, we propose ORGEval, a graph-theoretic evaluation framework for assessing LLMs' capabilities in formulating linear and mixed-integer linear programs. ORGEval represents optimization models as graphs, reducing equivalence detection to graph isomorphism testing. We identify and prove a sufficient condition, when the tested graphs are symmetric decomposable (SD), under which the Weisfeiler-Lehman (WL) test is guaranteed to correctly detect isomorphism. Building on this, ORGEval integrates a tailored variant of the WL-test with an SD detection algorithm to evaluate model equivalence. By focusing on structural equivalence rather than instance-level configurations, ORGEval is robust to numerical variations. Experimental results show that our method can successfully detect model equivalence and produce 100\% consistent results across random parameter configurations, while significantly outperforming solver-based methods in runtime, especially on difficult problems. Leveraging ORGEval, we construct the Bench4Opt dataset and benchmark state-of-the-art LLMs on optimization modeling. Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting, outperforming even leading reasoning models.

ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling

TL;DR

ORGEval addresses the challenge of evaluating LLMs in optimization modeling where solver-based metrics are unreliable; it models optimization problems as graphs and reduces equivalence to graph isomorphism, exploiting a symmetric decomposable SD condition with a tailored WL test. The paper formalizes model equivalence, introduces Bench4Opt dataset (model-data separation), and demonstrates that ORGEval achieves 100% consistency across data configurations and substantial runtime gains over solvers, while benchmarking LLMs where DeepSeek-V3 and Claude-Opus-4 lead in accuracy under direct prompting. The results provide a principled, scalable framework for assessing LLMs in optimization modeling with practical impact for industrial problem formulation.

Abstract

Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult due to the absence of robust metrics. Existing solver-based approaches often face inconsistency, infeasibility issues, and high computational costs. To address these issues, we propose ORGEval, a graph-theoretic evaluation framework for assessing LLMs' capabilities in formulating linear and mixed-integer linear programs. ORGEval represents optimization models as graphs, reducing equivalence detection to graph isomorphism testing. We identify and prove a sufficient condition, when the tested graphs are symmetric decomposable (SD), under which the Weisfeiler-Lehman (WL) test is guaranteed to correctly detect isomorphism. Building on this, ORGEval integrates a tailored variant of the WL-test with an SD detection algorithm to evaluate model equivalence. By focusing on structural equivalence rather than instance-level configurations, ORGEval is robust to numerical variations. Experimental results show that our method can successfully detect model equivalence and produce 100\% consistent results across random parameter configurations, while significantly outperforming solver-based methods in runtime, especially on difficult problems. Leveraging ORGEval, we construct the Bench4Opt dataset and benchmark state-of-the-art LLMs on optimization modeling. Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting, outperforming even leading reasoning models.

Paper Structure

This paper contains 47 sections, 10 theorems, 30 equations, 12 figures, 4 tables, 3 algorithms.

Key Result

Theorem 3.1

Suppose $\mathcal{P}_{1}$,$\mathcal{P}_{2}$ are symmetric decomposable, then $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ shares the same coloring distribution after WL-test coloring $\Longleftrightarrow \mathcal{P}_{1}\sim \mathcal{P}_{2}$.

Figures (12)

  • Figure 1: Evaluation Pipeline: Each example in our dataset includes a problem description, a parameter file, and a model instance with parameters applied. To assess an AI system's modeling capability, we evaluate the equivalence between the AI-generated instance and the ground truth instance in our dataset, using a common set of parameters. These instance pairs can be represented by two bipartite graphs, on which we applied an isomorphism testing algorithm, and meanwhile, checked the sufficiency of the algorithm.
  • Figure 2: Evaluation Framework: The ultimate goal of modeling equivalence is to directly assess whether one model can be equivalently transformed to a standard model (top left). Existing work tests the equivalence between numerical instances by comparing their optimal objective (top right). Our evaluation method approximates the ultimate goal of directly evaluating modeling equivalence by randomly sampling instances and testing instance isomorphism (bottom).
  • Figure 3: Transform model instance to a bipartite graph.
  • Figure 4: Example for concise version word problem on cargo loading.
  • Figure 5: Example for word problem on cargo loading.
  • ...and 7 more figures

Theorems & Definitions (31)

  • Definition 1: Modeling Problem Instance
  • Definition 2: MILP/LP Model
  • Definition 3: Model-lossless-reduction
  • Definition 4: Execution Accuracy
  • Definition 5: Model Isomorphism
  • Definition 6: Instance-Level Isomorphism
  • Definition 7: Weighted Bipartite Graph Instance Representation
  • Definition 3.1: Symmetric Decomposable Instance
  • Theorem 3.1
  • Definition C.1: Model Equivalence
  • ...and 21 more