Evaluating LLM Reasoning in the Operations Research Domain with ORQA
Mahdi Mostajabdaveh, Timothy T. Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, Yong Zhang
TL;DR
This work introduces ORQA, a dedicated benchmark to evaluate how well open-source LLMs generalize to Operations Research by solving real-world, multi-step optimization problems described in natural language. By assembling 1513 instances across 20 domains with expert-crafted questions and four-option answers, the authors systematically assess open-source LLMs under varied prompting strategies, including standard and Chain-of-Thought prompts. The results reveal modest performance overall, with the best model achieving about 0.772 accuracy (human expert ~0.93 on a subset), and show that CoT prompting frequently degrades performance in this domain, highlighting persistent challenges in domain-specific reasoning and data scarcity. ORQA thus provides a reproducible, domain-focused testbed to guide future improvements in LLM reasoning, knowledge integration, and potentially architecture choices beyond standard transformers.
Abstract
In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark designed to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when confronted with diverse and complex optimization problems. The dataset, developed by OR experts, features real-world optimization problems that demand multistep reasoning to construct their mathematical models. Our evaluations of various open source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral, reveal their modest performance, highlighting a gap in their ability to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs generalization capabilities, offering valuable insights for future research in this area. The dataset and evaluation code are publicly available.
