Constructing Industrial-Scale Optimization Modeling Benchmark

Zhong Li; Hongliang Lu; Tao Wei; Wenyu Liu; Yuxuan Chen; Yuan Lan; Fan Zhang; Zaiwen Wen

Constructing Industrial-Scale Optimization Modeling Benchmark

Zhong Li, Hongliang Lu, Tao Wei, Wenyu Liu, Yuxuan Chen, Yuan Lan, Fan Zhang, Zaiwen Wen

TL;DR

This work tackles the gap between industrial-scale optimization and NL-to-Opt evaluation by constructing MIPLIB-NL, a large-scale NL-to-Opt benchmark derived from real MILPs in MIPLIB 2017 using a structure-aware reverse-generation pipeline. It recovers loop-based constraint scaffolds, enforces a model–data separation format, and validates semantic fidelity through independent NL-to-Opt reconstructions and expert review. Comprising 223 one-to-one reconstructions, MIPLIB-NL preserves the mathematical content of the original instances while enabling compact NL descriptions suitable for industrial-scale models up to $10^{7}$ variables and constraints. Experiments show that state-of-the-art NL-to-Opt systems, which perform well on toy benchmarks, experience substantial performance degradation on MIPLIB-NL, revealing failure modes tied to indexing, constraint interaction, and data binding that are invisible at smaller scales. This data-centric benchmark design provides a principled foundation for advancing NL-to-Opt research toward realistic, industrially relevant capabilities.

Abstract

Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with $10^{3}$--$10^{6}$ (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.

Constructing Industrial-Scale Optimization Modeling Benchmark

TL;DR

variables and constraints. Experiments show that state-of-the-art NL-to-Opt systems, which perform well on toy benchmarks, experience substantial performance degradation on MIPLIB-NL, revealing failure modes tied to indexing, constraint interaction, and data binding that are invisible at smaller scales. This data-centric benchmark design provides a principled foundation for advancing NL-to-Opt research toward realistic, industrially relevant capabilities.

Abstract

(or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.

Paper Structure (68 sections, 18 equations, 16 figures, 9 tables)

This paper contains 68 sections, 18 equations, 16 figures, 9 tables.

Introduction
Preliminary
LLMs for Optimization Modeling
Existing NL-to-Optimization Benchmarks
Problem Setting and a Data-Centric Perspective
Constructing Industrial-Scale Benchmark via Structure-Aware Reverse Generation
Overview of the Construction Pipeline
Structural Abstraction from MPS Math Models
Structure-Preserving NL Reverse Generation
Semantic Validation
The MIPLIB-NL Benchmark: Properties and Characteristics
Experiments
Evaluation Overview
Overall Performance across Datasets (RQ1)
Scale-Dependent Performance (RQ2)
...and 53 more sections

Figures (16)

Figure 1: Accuracy distributions of state-of-the-art LLM-based methods for optimization modeling across existing benchmarks and MIPLIB-NL (ours). Each boxplot shows the distribution of accuracies across methods for a given benchmark, with individual markers indicating method-level performance. MIPLIB-NL remains substantially more challenging, with generally lower performance across methods.
Figure 2: Distributions of problem sizes across optimization benchmarks, measured as the total number of variables plus constraints per instance. Each violin represents the per-instance size distribution of a dataset. A shared color scale encodes dataset scale, where warmer colors indicate larger maximum instance sizes. Compared to prior benchmarks, MIPLIB-NL (ours) spans a markedly larger scale and exhibits a substantially heavier upper tail, reflecting the presence of many large-scale industrial instances. While the violin shape is compressed near the lower range due to the extreme scale variation, a bucketed analysis in Fig. \ref{['fig:compression_ratio_main']} further reveals that a significant fraction of MIPLIB-NL instances lie well beyond the scale of existing benchmarks.
Figure 3: Structure-aware reverse construction pipeline: starting from MIPLIB 2017 math models, we perform (1) expert-driven structural abstraction of (often large) MPS formulations, (2) structure-preserving Opt-to-NL reverse generation via expert-designed blueprints, and (3) semantic validation via independent NL-to-Opt reconstruction with human--LLM interaction.
Figure 4: Expert-driven structural understanding and loop scaffold construction. To recover high-level modeling logic from industrial MPS files (Stage 1), we adopt an expert-in-the-loop process. OR experts combine on-model analysis (variable naming patterns, constraint blocks, coefficient regularities, sparsity and incidence structure) with off-model evidence (e.g., MIPLIB metadata, source papers, and domain hints if available) to form hypotheses about problem families, variable roles, and constraint semantics. These hypotheses are refined through inspection of decomposed model structure and validated against algebraic rows in the MPS file, yielding indexed variable groups, constraint families (loops), and atomic blocks. The resulting loop-based scaffold serves as the foundation for deterministic NL blueprint construction and structure-preserving Opt-to-NL generation (Stage 2).
Figure 5: Distribution of model size and compression ratio across scale buckets on MIPLIB-NL. The top row shows the distribution of problem size, measured as the total number of variables and constraints. The bottom row reports the corresponding compression ratios, defined as $\mathrm{CR} = \frac{\text{MPS file size}}{\text{NL file size} + \text{auxiliary data file size}},$ which capture how much more compact the structured natural-language description with model--data separation is compared to the flattened, solver-ready MPS representation. Buckets are defined by model size, and annotations indicate the number and proportion of instances in each bucket. The results reveal that, even as instance scale grows substantially, the separated NL--data representation remains markedly more compact, highlighting the importance of scale-aware evaluation for large optimization problems.
...and 11 more figures

Constructing Industrial-Scale Optimization Modeling Benchmark

TL;DR

Abstract

Constructing Industrial-Scale Optimization Modeling Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (16)