Table of Contents
Fetching ...

DCP-Bench-Open: Evaluating LLMs for Constraint Modelling of Discrete Combinatorial Problems

Kostis Michailidis, Dimos Tsouros, Tias Guns

TL;DR

The paper tackles the bottleneck of constraint modelling for discrete combinatorial problems by introducing DCP-Bench-Open, a diverse benchmark of 164 real-world problems sourced from Constraint Programming and Operations Research communities. It systematically evaluates state-of-the-art LLMs across three modelling frameworks (CPMpy, MiniZinc, and OR-Tools CP-SAT), comparing prompt levels and inference-time compute methods, including retrieval-augmented in-context learning, reasoning, repeated sampling, and self-verification. Key findings show Python-based frameworks, especially CPMpy, yield higher modelling accuracy; prompt richness and inference-time techniques can push performance to around 90% under favorable configurations, while multi-instance evaluation reveals generalization gaps. The benchmark and results underscore the potential and current limitations of LLM-assisted constraint modelling, offering guidance for practitioners and directions for future research in multi-instance prompting and framework-agnostic evaluation.

Abstract

Discrete Combinatorial Problems (DCPs) are prevalent in industrial decision-making and optimisation. However, while constraint solving technologies for DCPs have advanced significantly, the core process of formalising them, namely constraint modelling, requires significant expertise and remains a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for discrete constraint modelling are often limited to small, homogeneous, or domain-specific problems, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing DCP-Bench-Open, a novel benchmark that includes a diverse set of well-known discrete combinatorial problems sourced from the Constraint Programming (CP) and Operations Research (OR) communities, structured explicitly for evaluating LLM-driven constraint modelling. With this dataset, and given the variety of modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 91% on this highly challenging benchmark. DCP-Bench-Open is publicly available.

DCP-Bench-Open: Evaluating LLMs for Constraint Modelling of Discrete Combinatorial Problems

TL;DR

The paper tackles the bottleneck of constraint modelling for discrete combinatorial problems by introducing DCP-Bench-Open, a diverse benchmark of 164 real-world problems sourced from Constraint Programming and Operations Research communities. It systematically evaluates state-of-the-art LLMs across three modelling frameworks (CPMpy, MiniZinc, and OR-Tools CP-SAT), comparing prompt levels and inference-time compute methods, including retrieval-augmented in-context learning, reasoning, repeated sampling, and self-verification. Key findings show Python-based frameworks, especially CPMpy, yield higher modelling accuracy; prompt richness and inference-time techniques can push performance to around 90% under favorable configurations, while multi-instance evaluation reveals generalization gaps. The benchmark and results underscore the potential and current limitations of LLM-assisted constraint modelling, offering guidance for practitioners and directions for future research in multi-instance prompting and framework-agnostic evaluation.

Abstract

Discrete Combinatorial Problems (DCPs) are prevalent in industrial decision-making and optimisation. However, while constraint solving technologies for DCPs have advanced significantly, the core process of formalising them, namely constraint modelling, requires significant expertise and remains a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for discrete constraint modelling are often limited to small, homogeneous, or domain-specific problems, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing DCP-Bench-Open, a novel benchmark that includes a diverse set of well-known discrete combinatorial problems sourced from the Constraint Programming (CP) and Operations Research (OR) communities, structured explicitly for evaluating LLM-driven constraint modelling. With this dataset, and given the variety of modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 91% on this highly challenging benchmark. DCP-Bench-Open is publicly available.

Paper Structure

This paper contains 32 sections, 6 equations, 34 figures, 5 tables, 1 algorithm.

Figures (34)

  • Figure 1: LLM-driven constraint modelling: users state a combinatorial problem in natural language, which the system transforms into a formal constraint representation, and delegates the latter to a constraint solver.
  • Figure 2: Retrieval-augmented few-shot prompting: on-the-fly example retrieval to provide more relevant patterns to the LLM.
  • Figure 3: Iterative Self-Verification: the generated CP model is verified iteratively; both the original problem description and produced solution are also provided back to the LLM. In this work, we use a single LLM for both model generation and verification (thus, self-verification).
  • Figure 4: Percentages of successfully generated models (Single Instance Accuracy). From top to bottom: MiniZinc, CPMpy, OR-Tools.
  • Figure 5: Average Single Instance Accuracy by Framework and System Prompt Level (Aggregated across LLMs).
  • ...and 29 more figures