Constructing Industrial-Scale Optimization Modeling Benchmark
Zhong Li, Hongliang Lu, Tao Wei, Wenyu Liu, Yuxuan Chen, Yuan Lan, Fan Zhang, Zaiwen Wen
TL;DR
This work tackles the gap between industrial-scale optimization and NL-to-Opt evaluation by constructing MIPLIB-NL, a large-scale NL-to-Opt benchmark derived from real MILPs in MIPLIB 2017 using a structure-aware reverse-generation pipeline. It recovers loop-based constraint scaffolds, enforces a model–data separation format, and validates semantic fidelity through independent NL-to-Opt reconstructions and expert review. Comprising 223 one-to-one reconstructions, MIPLIB-NL preserves the mathematical content of the original instances while enabling compact NL descriptions suitable for industrial-scale models up to $10^{7}$ variables and constraints. Experiments show that state-of-the-art NL-to-Opt systems, which perform well on toy benchmarks, experience substantial performance degradation on MIPLIB-NL, revealing failure modes tied to indexing, constraint interaction, and data binding that are invisible at smaller scales. This data-centric benchmark design provides a principled foundation for advancing NL-to-Opt research toward realistic, industrially relevant capabilities.
Abstract
Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with $10^{3}$--$10^{6}$ (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.
