Table of Contents
Fetching ...

Evaluating LLM Reasoning in the Operations Research Domain with ORQA

Mahdi Mostajabdaveh, Timothy T. Yu, Samarendra Chandan Bindu Dash, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, Yong Zhang

TL;DR

This work introduces ORQA, a dedicated benchmark to evaluate how well open-source LLMs generalize to Operations Research by solving real-world, multi-step optimization problems described in natural language. By assembling 1513 instances across 20 domains with expert-crafted questions and four-option answers, the authors systematically assess open-source LLMs under varied prompting strategies, including standard and Chain-of-Thought prompts. The results reveal modest performance overall, with the best model achieving about 0.772 accuracy (human expert ~0.93 on a subset), and show that CoT prompting frequently degrades performance in this domain, highlighting persistent challenges in domain-specific reasoning and data scarcity. ORQA thus provides a reproducible, domain-focused testbed to guide future improvements in LLM reasoning, knowledge integration, and potentially architecture choices beyond standard transformers.

Abstract

In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark designed to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when confronted with diverse and complex optimization problems. The dataset, developed by OR experts, features real-world optimization problems that demand multistep reasoning to construct their mathematical models. Our evaluations of various open source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral, reveal their modest performance, highlighting a gap in their ability to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs generalization capabilities, offering valuable insights for future research in this area. The dataset and evaluation code are publicly available.

Evaluating LLM Reasoning in the Operations Research Domain with ORQA

TL;DR

This work introduces ORQA, a dedicated benchmark to evaluate how well open-source LLMs generalize to Operations Research by solving real-world, multi-step optimization problems described in natural language. By assembling 1513 instances across 20 domains with expert-crafted questions and four-option answers, the authors systematically assess open-source LLMs under varied prompting strategies, including standard and Chain-of-Thought prompts. The results reveal modest performance overall, with the best model achieving about 0.772 accuracy (human expert ~0.93 on a subset), and show that CoT prompting frequently degrades performance in this domain, highlighting persistent challenges in domain-specific reasoning and data scarcity. ORQA thus provides a reproducible, domain-focused testbed to guide future improvements in LLM reasoning, knowledge integration, and potentially architecture choices beyond standard transformers.

Abstract

In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark designed to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when confronted with diverse and complex optimization problems. The dataset, developed by OR experts, features real-world optimization problems that demand multistep reasoning to construct their mathematical models. Our evaluations of various open source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral, reveal their modest performance, highlighting a gap in their ability to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs generalization capabilities, offering valuable insights for future research in this area. The dataset and evaluation code are publicly available.

Paper Structure

This paper contains 25 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Left: Dataset instance containing the description, question, options, and answer (bold). Right: Example reasoning steps needed to answer the simple question.
  • Figure 2: An example of optimization problem components, their relationships, and corresponding mathematical formulations.
  • Figure 3: Selection, creation, and verification process for the ORQA dataset.
  • Figure 4: The different components of a prompt. The pre-defined text is in black; we provide an example (in blue), and the LLM output (in red).
  • Figure 5: Heatmap of LLM performance on different question types. Performance is the average accuracy over the five prompting strategies.
  • ...and 4 more figures