When Words Change the Model: Sensitivity of LLMs for Constraint Programming Modelling
Alessio Pellegrino, Jacopo Mauro
TL;DR
This paper investigates how robust large language models are at translating natural-language problem descriptions into constraint programming models. By systematically perturbing CSPlib problems with context changes and distractions, and evaluating three state-of-the-art LLMs in a zero-shot setting, the authors reveal strong performance on original descriptions but sharp declines when wording shifts occur, highlighting concerns about data contamination and shallow contextual understanding. The study shows that explicit mathematical anchors improve robustness, while surface cues and distractions can mislead models, underscoring the need for careful prompt design and potential hybrid approaches that couple LLMs with formal solvers. These findings have practical implications for deploying LLM-based modelling tools in CP, suggesting directions for interactive refinement, prompt engineering, and solver-assisted pipelines to ensure reliable, verifiable models.
Abstract
One of the long-standing goals in optimisation and constraint programming is to describe a problem in natural language and automatically obtain an executable, efficient model. Large language models appear to bring this vision closer, showing impressive results in automatically generating models for classical benchmarks. However, much of this apparent success may derive from data contamination rather than genuine reasoning: many standard CP problems are likely included in the training data of these models. To examine this hypothesis, we systematically rephrased and perturbed a set of well-known CSPLib problems to preserve their structure while modifying their context and introducing misleading elements. We then compared the models produced by three representative LLMs across original and modified descriptions. Our qualitative analysis shows that while LLMs can produce syntactically valid and semantically plausible models, their performance drops sharply under contextual and linguistic variation, revealing shallow understanding and sensitivity to wording.
