DCP-Bench-Open: Evaluating LLMs for Constraint Modelling of Discrete Combinatorial Problems
Kostis Michailidis, Dimos Tsouros, Tias Guns
TL;DR
The paper tackles the bottleneck of constraint modelling for discrete combinatorial problems by introducing DCP-Bench-Open, a diverse benchmark of 164 real-world problems sourced from Constraint Programming and Operations Research communities. It systematically evaluates state-of-the-art LLMs across three modelling frameworks (CPMpy, MiniZinc, and OR-Tools CP-SAT), comparing prompt levels and inference-time compute methods, including retrieval-augmented in-context learning, reasoning, repeated sampling, and self-verification. Key findings show Python-based frameworks, especially CPMpy, yield higher modelling accuracy; prompt richness and inference-time techniques can push performance to around 90% under favorable configurations, while multi-instance evaluation reveals generalization gaps. The benchmark and results underscore the potential and current limitations of LLM-assisted constraint modelling, offering guidance for practitioners and directions for future research in multi-instance prompting and framework-agnostic evaluation.
Abstract
Discrete Combinatorial Problems (DCPs) are prevalent in industrial decision-making and optimisation. However, while constraint solving technologies for DCPs have advanced significantly, the core process of formalising them, namely constraint modelling, requires significant expertise and remains a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for discrete constraint modelling are often limited to small, homogeneous, or domain-specific problems, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing DCP-Bench-Open, a novel benchmark that includes a diverse set of well-known discrete combinatorial problems sourced from the Constraint Programming (CP) and Operations Research (OR) communities, structured explicitly for evaluating LLM-driven constraint modelling. With this dataset, and given the variety of modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 91% on this highly challenging benchmark. DCP-Bench-Open is publicly available.
