Table of Contents
Fetching ...

OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

Zezhen Ding, Zhen Tan, Jiheng Zhang, Tianlong Chen

TL;DR

OR-R1 tackles the data-hungry and inconsistent outputs problem in automated OR modeling by combining a data-efficient SFT stage with Test-Time Group Relative Policy Optimization (TGRPO) that leverages unlabeled data. Its composite reward design—Format, Valid-Code, and Majority Voting—drives structured, executable, and robust solutions. The method achieves an average solving accuracy of $67.7\%$ using only a tenth of the synthetic data of prior work, and reduces the single-shot vs multi-shot gap to $7\%$, across eight real-world benchmarks. This yields a scalable, cost-effective pathway for industrial OR automation with significantly reduced labeling and data requirements.

Abstract

Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of $67.7\%$, using only $1/10$ the synthetic data required by prior methods such as ORLM, exceeding ORLM's solving accuracy by up to $4.2\%$. Remarkably, OR-R1 outperforms ORLM by over $2.4\%$ with just $100$ synthetic samples. Furthermore, TGRPO contributes an additional $3.1\%-6.4\%$ improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from $13\%$ to $7\%$. Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.

OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

TL;DR

OR-R1 tackles the data-hungry and inconsistent outputs problem in automated OR modeling by combining a data-efficient SFT stage with Test-Time Group Relative Policy Optimization (TGRPO) that leverages unlabeled data. Its composite reward design—Format, Valid-Code, and Majority Voting—drives structured, executable, and robust solutions. The method achieves an average solving accuracy of using only a tenth of the synthetic data of prior work, and reduces the single-shot vs multi-shot gap to , across eight real-world benchmarks. This yields a scalable, cost-effective pathway for industrial OR automation with significantly reduced labeling and data requirements.

Abstract

Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of , using only the synthetic data required by prior methods such as ORLM, exceeding ORLM's solving accuracy by up to . Remarkably, OR-R1 outperforms ORLM by over with just synthetic samples. Furthermore, TGRPO contributes an additional improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from to . Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.

Paper Structure

This paper contains 23 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of OR-R1.
  • Figure 2: The figure illustrates the training process of TGRPO. The blue section shows an example of an operations research (OR) problem. The green section presents sample outputs, including the mathematical model and corresponding code. The light yellow section displays the code execution results, while the yellow section shows the majority voting of execution results. The orange section represents the reward function, and the red section indicates the advantage function.
  • Figure 3: Overview of training dynamics for OR-R1 core reward components of SFT(3K)-TGRPO.
  • Figure 4: Performance of Pass@1 and Pass@8 during TGRPO Training. Pass@1 measures the accuracy when only the model's top prediction is considered, while Pass@8 reflects the probability that at least one out of the top 8 generated solutions is correct.
  • Figure 5: The impact of different data scales on TGRPO performance. In TGRPO(N), N denotes the number of samples randomly selected from each test set for TGRPO training. Notably, all models here were trained for 160 steps.