DualSchool: How Reliable are LLMs for Optimization Education?
Michael Klamkin, Arnaud Deza, Sikai Cheng, Haoruo Zhao, Pascal Van Hentenryck
TL;DR
The paper addresses whether large language models can reliably perform Primal-to-Dual Conversion (P2DC) for linear programs, by introducing DualSchool, a framework that automatically generates duals, injects errors, and verifies correctness with a convention-invariant metric called Canonical Graph Edit Distance (CGED). It shows that while LLMs can recount P2DC procedures, they struggle to produce correct duals, even on small instances, highlighting a gap between procedural knowledge and reliable execution. The authors release a large dataset of primal-dual pairs and an automated evaluation pipeline that combines CGED with validity checks, revealing limited reliability of open LLMs for P2DC tasks. The work has educational and methodological implications, offering a benchmark and potential pathways (e.g., RL with symbolic feedback) to improve reasoning systems and AI-assisted optimization tutoring.
Abstract
Consider the following task taught in introductory optimization courses which addresses challenges articulated by the community at the intersection of (generative) AI and OR: generate the dual of a linear program. LLMs, being trained at web-scale, have the conversion process and many instances of Primal to Dual Conversion (P2DC) at their disposal. Students may thus reasonably expect that LLMs would perform well on the P2DC task. To assess this expectation, this paper introduces DualSchool, a comprehensive framework for generating and verifying P2DC instances. The verification procedure of DualSchool uses the Canonical Graph Edit Distance, going well beyond existing evaluation methods for optimization models, which exhibit many false positives and negatives when applied to P2DC. Experiments performed by DualSchool reveal interesting findings. Although LLMs can recite the conversion procedure accurately, state-of-the-art open LLMs fail to consistently produce correct duals. This finding holds even for the smallest two-variable instances and for derivative tasks, such as correctness, verification, and error classification. The paper also discusses the implications for educators, students, and the development of large reasoning systems.
