Can Large Language Models Reason and Optimize Under Constraints?

Fabien Bernier; Salah Ghamizi; Pantelis Dogoulis; Maxime Cordy

Can Large Language Models Reason and Optimize Under Constraints?

Fabien Bernier, Salah Ghamizi, Pantelis Dogoulis, Maxime Cordy

Abstract

Large Language Models (LLMs) have demonstrated great capabilities across diverse natural language tasks; yet their ability to solve abstraction and optimization problems with constraints remains scarcely explored. In this paper, we investigate whether LLMs can reason and optimize under the physical and operational constraints of Optimal Power Flow (OPF) problem. We introduce a challenging evaluation setup that requires a set of fundamental skills such as reasoning, structured input handling, arithmetic, and constrained optimization. Our evaluation reveals that SoTA LLMs fail in most of the tasks, and that reasoning LLMs still fail in the most complex settings. Our findings highlight critical gaps in LLMs' ability to handle structured reasoning under constraints, and this work provides a rigorous testing environment for developing more capable LLM assistants that can tackle real-world power grid optimization problems.

Can Large Language Models Reason and Optimize Under Constraints?

Abstract

Paper Structure (35 sections, 2 equations, 5 figures, 4 tables)

This paper contains 35 sections, 2 equations, 5 figures, 4 tables.

Introduction
Related work
LLMs for Power Systems
LLM Reasoning Capabilities
LLMs on Constraint Satisfaction Tasks
Method
Problem Formulation of Optimization with Reasoning
Abstraction.
Mathematics.
Multi-Step optimization.
Power Grid Optimization task
Dataset.
Mean Squared Error.
Evaluated Constraints.
Experimental Protocol
...and 20 more sections

Figures (5)

Figure 1: Overview of our study. LLMs are evaluated with simple in-context learning, and in the evaluation settings after supervised fine-tuning (for non-reasoning models) and group relative policy optimization (for reasoning models); all models outputs are evaluated with the same metrics: MSE, constraint satisfaction, and structure validity.
Figure 2: Performance of vanilla models (before fine-tuning) on cases 14, 30 and 118.
Figure 3: Performance of vanilla models (before fine-tuning) and of fine-tuned models on N and N-1 cases.
Figure 4: Performance of reasoning models, before GRPO (vanilla) and after GRPO, on N and N-1 scenarios.
Figure 5: Prompt used for the OPF task assessment.

Can Large Language Models Reason and Optimize Under Constraints?

Abstract

Can Large Language Models Reason and Optimize Under Constraints?

Authors

Abstract

Table of Contents

Figures (5)