Are Your Generated Instances Truly Useful? GenBench-MILP: A Benchmark Suite for MILP Instance Generation
Yidong Luo, Chenguang Wang, Dong Li, Tianshu Yu
TL;DR
GenBench-MILP addresses the gap in MILP instance-generation evaluation by combining solver-independent validity and structural-mimicry metrics with solver-dependent hardness and solver-internal feature analyses. The framework treats solver behavior as an expert assessor, using metrics such as root-node gaps, heuristic success rates, and cut-plane usage to reveal genuine computational differences that static graphs miss. Through extensive experiments with G2MILP, ACM-MILP, and DIG-MILP, the study shows that high structural similarity does not guarantee similar solver interactions or difficulty, and it demonstrates the utility of solver fingerprints for robust evaluation and cross-solver profiling. The work provides a modular, extensible toolkit and benchmarks that will guide the development of higher-fidelity MILP generators and enable rigorous comparisons across approaches and solvers, with practical implications for downstream ML tasks and solver testing.
Abstract
The proliferation of machine learning-based methods for Mixed-Integer Linear Programming (MILP) instance generation has surged, driven by the need for diverse training datasets. However, a critical question remains: Are these generated instances truly useful and realistic? Current evaluation protocols often rely on superficial structural metrics or simple solvability checks, which frequently fail to capture the true computational complexity of real-world problems. To bridge this gap, we introduce GenBench-MILP, a comprehensive benchmark suite designed for the standardized and objective evaluation of MILP generators. Our framework assesses instance quality across four key dimensions: mathematical validity, structural similarity, computational hardness, and utility in downstream tasks. A distinctive innovation of GenBench-MILP is the analysis of solver-internal features -- including root node gaps, heuristic success rates, and cut plane usage. By treating the solver's dynamic behavior as an expert assessment, we reveal nuanced computational discrepancies that static graph features miss. Our experiments on instance generative models demonstrate that instances with high structural similarity scores can still exhibit drastically divergent solver interactions and difficulty levels. By providing this multifaceted evaluation toolkit, GenBench-MILP aims to facilitate rigorous comparisons and guide the development of high-fidelity instance generators.
