OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling
Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, Zaiwen Wen
TL;DR
OptMATH tackles data scarcity in optimization modeling by generating a large, labeled NL/MF/PD triplet dataset through a closed-loop backtranslation and forward-modeling pipeline, validated by rejection sampling. The AutoFormulator is fine-tuned with LoRA on OptMATH-Train to translate natural language problem descriptions into both general formulations and solver code. The authors demonstrate state-of-the-art performance on NL4OPT, MAMO EasyLP, and OptMATH-Bench across 0.5B–32B models, and create a challenging OptMATH-Bench as a long-context benchmark. The dataset and pipeline enable scalable, domain-adaptive optimization modeling and lay groundwork for integrating optimization with advanced AI techniques.
Abstract
Despite the rapid development of large language models (LLMs), a fundamental challenge persists: the lack of high-quality optimization modeling datasets hampers LLMs' robust modeling of practical optimization problems from natural language descriptions (NL). This data scarcity also contributes to the generalization difficulties experienced by learning-based methods. To address these challenges, we propose a scalable framework for synthesizing a high-quality dataset, named OptMATH. Starting from curated seed data with mathematical formulations (MF), this framework automatically generates problem data (PD) with controllable complexity. Then, a back-translation step is employed to obtain NL. To verify the correspondence between the NL and the PD, a forward modeling step followed by rejection sampling is used. The accepted pairs constitute the training part of OptMATH. Then a collection of rejected pairs is identified and further filtered. This collection serves as a new benchmark for optimization modeling, containing difficult instances whose lengths are much longer than these of NL4OPT and MAMO. Through extensive experiments, we demonstrate that models of various sizes (0.5B-32B parameters) trained on OptMATH achieve superior results on multiple modeling benchmarks, thereby validating the effectiveness and scalability of our approach. Our dataset is publicly available at https://github.com/AuroraLHL/OptMATH.
