Table of Contents
Fetching ...

Reinforcement Learning-based Heuristics to Guide Domain-Independent Dynamic Programming

Minori Narita, Ryo Kuroiwa, J. Christopher Beck

TL;DR

This work tackles enhancing Domain-Independent Dynamic Programming (DIDP) by integrating reinforcement learning (RL) to guide the search. It formalizes a mapping from a DIDP model to an RL MDP and develops two guidance strategies: value-based using Deep Q-Networks (DQN) and policy-based using Proximal Policy Optimization (PPO), with state representations based on graph attention networks and related architectures. Empirical results on TSP, TSPTW, 0–1 Knapsack, and Portfolio Optimization show that PPO-guided DIDP often yields the smallest gaps per node expansion and can outperform the dual-bound baselines and problem-specific heuristics, albeit with higher per-node evaluation costs, while still achieving better run-time in several domains. The findings demonstrate a natural and effective synergy between DP and RL that can enhance exact solvers, and point to future work on automating RL-model construction and reducing state-evaluation overhead for practical impact in combinatorial optimization.

Abstract

Domain-Independent Dynamic Programming (DIDP) is a state-space search paradigm based on dynamic programming for combinatorial optimization. In its current implementation, DIDP guides the search using user-defined dual bounds. Reinforcement learning (RL) is increasingly being applied to combinatorial optimization problems and shares several key structures with DP, being represented by the Bellman equation and state-based transition systems. We propose using reinforcement learning to obtain a heuristic function to guide the search in DIDP. We develop two RL-based guidance approaches: value-based guidance using Deep Q-Networks and policy-based guidance using Proximal Policy Optimization. Our experiments indicate that RL-based guidance significantly outperforms standard DIDP and problem-specific greedy heuristics with the same number of node expansions. Further, despite longer node evaluation times, RL guidance achieves better run-time performance than standard DIDP on three of four benchmark domains.

Reinforcement Learning-based Heuristics to Guide Domain-Independent Dynamic Programming

TL;DR

This work tackles enhancing Domain-Independent Dynamic Programming (DIDP) by integrating reinforcement learning (RL) to guide the search. It formalizes a mapping from a DIDP model to an RL MDP and develops two guidance strategies: value-based using Deep Q-Networks (DQN) and policy-based using Proximal Policy Optimization (PPO), with state representations based on graph attention networks and related architectures. Empirical results on TSP, TSPTW, 0–1 Knapsack, and Portfolio Optimization show that PPO-guided DIDP often yields the smallest gaps per node expansion and can outperform the dual-bound baselines and problem-specific heuristics, albeit with higher per-node evaluation costs, while still achieving better run-time in several domains. The findings demonstrate a natural and effective synergy between DP and RL that can enhance exact solvers, and point to future work on automating RL-model construction and reducing state-evaluation overhead for practical impact in combinatorial optimization.

Abstract

Domain-Independent Dynamic Programming (DIDP) is a state-space search paradigm based on dynamic programming for combinatorial optimization. In its current implementation, DIDP guides the search using user-defined dual bounds. Reinforcement learning (RL) is increasingly being applied to combinatorial optimization problems and shares several key structures with DP, being represented by the Bellman equation and state-based transition systems. We propose using reinforcement learning to obtain a heuristic function to guide the search in DIDP. We develop two RL-based guidance approaches: value-based guidance using Deep Q-Networks and policy-based guidance using Proximal Policy Optimization. Our experiments indicate that RL-based guidance significantly outperforms standard DIDP and problem-specific greedy heuristics with the same number of node expansions. Further, despite longer node evaluation times, RL guidance achieves better run-time performance than standard DIDP on three of four benchmark domains.

Paper Structure

This paper contains 32 sections, 17 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Mapping from a DIDP model to an RL model for maximization problems. State constraints $\mathcal{C}$ and forced$_\tau$ are not mapped to the RL model.
  • Figure 2: Value-based guidance and policy-based guidance for DIDP. The equations for computing $f$-values in the figure are for maximization problems.
  • Figure 3: Results of applying heuristics to guide DIDP, averaged over 40 instances (20 each for small and medium sizes). Small instances have $n=20$ and medium instances have $n=50$, except for 0-1 Knapsack ($n=50$ small, $n=100$ medium).
  • Figure 4: A rectangle used to show the dual-bound based on best profit-weight ratio. Each colored box $j$ represents an investment $j$ that has not yet been considered.