Table of Contents
Fetching ...

Heuristics for Combinatorial Optimization via Value-based Reinforcement Learning: A Unified Framework and Analysis

Orit Davidovich, Shimrit Shtern, Segev Wasserkrug, Nimrod Megiddo

TL;DR

<3-5 sentence high-level summary> The paper addresses the theoretical gap for using value-based reinforcement learning to solve combinatorial optimization by proposing a unified translation of CO problems into undiscounted Markov decision processes via Karp-Held theory. It develops convergence and error-analysis results for projected and fitted value iteration within this CO-MDP framework, enabling explicit CO-optimality guarantees from ε-approximations of V^*. The work applies the framework to canonical problems like the Traveling Salesman Problem and the Shortest Path Problem and discusses sample-average approximation and estimation procedures, providing guidance on state embeddings and algorithm design. This delivers a rigorous foundation for RL-based CO solvers and informs practical aspects such as approximation schemes, contraction conditions, and problem-specific state-space constructions.

Abstract

Since the 1990s, considerable empirical work has been carried out to train statistical models, such as neural networks (NNs), as learned heuristics for combinatorial optimization (CO) problems. When successful, such an approach eliminates the need for experts to design heuristics per problem type. Due to their structure, many hard CO problems are amenable to treatment through reinforcement learning (RL). Indeed, we find a wealth of literature training NNs using value-based, policy gradient, or actor-critic approaches, with promising results, both in terms of empirical optimality gaps and inference runtimes. Nevertheless, there has been a paucity of theoretical work undergirding the use of RL for CO problems. To this end, we introduce a unified framework to model CO problems through Markov decision processes (MDPs) and solve them using RL techniques. We provide easy-to-test assumptions under which CO problems can be formulated as equivalent undiscounted MDPs that provide optimal solutions to the original CO problems. Moreover, we establish conditions under which value-based RL techniques converge to approximate solutions of the CO problem with a guarantee on the associated optimality gap. Our convergence analysis provides: (1) a sufficient rate of increase in batch size and projected gradient descent steps at each RL iteration; (2) the resulting optimality gap in terms of problem parameters and targeted RL accuracy; and (3) the importance of a choice of state-space embedding. Together, our analysis illuminates the success (and limitations) of the celebrated deep Q-learning algorithm in this problem context.

Heuristics for Combinatorial Optimization via Value-based Reinforcement Learning: A Unified Framework and Analysis

TL;DR

<3-5 sentence high-level summary> The paper addresses the theoretical gap for using value-based reinforcement learning to solve combinatorial optimization by proposing a unified translation of CO problems into undiscounted Markov decision processes via Karp-Held theory. It develops convergence and error-analysis results for projected and fitted value iteration within this CO-MDP framework, enabling explicit CO-optimality guarantees from ε-approximations of V^*. The work applies the framework to canonical problems like the Traveling Salesman Problem and the Shortest Path Problem and discusses sample-average approximation and estimation procedures, providing guidance on state embeddings and algorithm design. This delivers a rigorous foundation for RL-based CO solvers and informs practical aspects such as approximation schemes, contraction conditions, and problem-specific state-space constructions.

Abstract

Since the 1990s, considerable empirical work has been carried out to train statistical models, such as neural networks (NNs), as learned heuristics for combinatorial optimization (CO) problems. When successful, such an approach eliminates the need for experts to design heuristics per problem type. Due to their structure, many hard CO problems are amenable to treatment through reinforcement learning (RL). Indeed, we find a wealth of literature training NNs using value-based, policy gradient, or actor-critic approaches, with promising results, both in terms of empirical optimality gaps and inference runtimes. Nevertheless, there has been a paucity of theoretical work undergirding the use of RL for CO problems. To this end, we introduce a unified framework to model CO problems through Markov decision processes (MDPs) and solve them using RL techniques. We provide easy-to-test assumptions under which CO problems can be formulated as equivalent undiscounted MDPs that provide optimal solutions to the original CO problems. Moreover, we establish conditions under which value-based RL techniques converge to approximate solutions of the CO problem with a guarantee on the associated optimality gap. Our convergence analysis provides: (1) a sufficient rate of increase in batch size and projected gradient descent steps at each RL iteration; (2) the resulting optimality gap in terms of problem parameters and targeted RL accuracy; and (3) the importance of a choice of state-space embedding. Together, our analysis illuminates the success (and limitations) of the celebrated deep Q-learning algorithm in this problem context.

Paper Structure

This paper contains 34 sections, 35 theorems, 187 equations, 5 figures, 2 tables, 2 algorithms.

Key Result

Theorem 2.1

There exists an optimal value function $V^*(\cdot ; c): \mathcal{A}^* / \sim \rightarrow \mathbb{R}$ and an immediate reward function $r(\cdot,\cdot; c) : \mathcal{A}^* / \sim \times \mathcal{A} \rightarrow \mathbb{R}$ such that the following will generate an optimal solution using Beam Search Reddy1977BeamVaswani2017AttentionJanner2021RL of width $B=1$ for

Figures (5)

  • Figure 1: Paper overview
  • Figure 2: States and action-labeled transitions for the KSP \ref{['eq:ksp_example']}. Some actions in $\mathcal{A}=\langle 5 \rangle$ leading to $s_\infty$ were omitted for clarity.
  • Figure 3: States and action-labeled transitions for the TSP with $d=3$.
  • Figure 4: states and action-labeled transitions for the SPP with $d=3$ ($\mathcal{A}=\Set{0,1,2,3}$, $v_\mathrm{src}=0$, $v_\mathrm{tgt}=3$). Some action labels were omitted for clarity.
  • Figure 5: Each panel corresponds to a single $(\mathrm{COP},d)$ instance and summarizes all contractive PVI runs ($\gamma<1$). Shown is the histogram of $\mathrm{Res}$\ref{['eq:empirical_slack']} derived from Proposition \ref{['prop:pvi']}; the $y$-axis reports relative frequency. The lower $5\%$ of values (right tail) were trimmed prior to plotting. Group sizes for each setting appear in Table \ref{['tab:experiment_optimality_gap']}. TSP12 statistics are derived from 30 COPs per $K$.

Theorems & Definitions (68)

  • Theorem 2.1: Informal
  • Definition 3.1
  • Definition 3.2
  • Example 3.1: Knapsack Problem
  • Definition 3.3
  • Definition 3.4
  • Theorem 3.5: Karp1967Programming, Theorem 3
  • Example 3.1 (KSP continued)
  • Proposition 3.5
  • Remark 3.6
  • ...and 58 more