Reinforcement Learning-based Heuristics to Guide Domain-Independent Dynamic Programming
Minori Narita, Ryo Kuroiwa, J. Christopher Beck
TL;DR
This work tackles enhancing Domain-Independent Dynamic Programming (DIDP) by integrating reinforcement learning (RL) to guide the search. It formalizes a mapping from a DIDP model to an RL MDP and develops two guidance strategies: value-based using Deep Q-Networks (DQN) and policy-based using Proximal Policy Optimization (PPO), with state representations based on graph attention networks and related architectures. Empirical results on TSP, TSPTW, 0–1 Knapsack, and Portfolio Optimization show that PPO-guided DIDP often yields the smallest gaps per node expansion and can outperform the dual-bound baselines and problem-specific heuristics, albeit with higher per-node evaluation costs, while still achieving better run-time in several domains. The findings demonstrate a natural and effective synergy between DP and RL that can enhance exact solvers, and point to future work on automating RL-model construction and reducing state-evaluation overhead for practical impact in combinatorial optimization.
Abstract
Domain-Independent Dynamic Programming (DIDP) is a state-space search paradigm based on dynamic programming for combinatorial optimization. In its current implementation, DIDP guides the search using user-defined dual bounds. Reinforcement learning (RL) is increasingly being applied to combinatorial optimization problems and shares several key structures with DP, being represented by the Bellman equation and state-based transition systems. We propose using reinforcement learning to obtain a heuristic function to guide the search in DIDP. We develop two RL-based guidance approaches: value-based guidance using Deep Q-Networks and policy-based guidance using Proximal Policy Optimization. Our experiments indicate that RL-based guidance significantly outperforms standard DIDP and problem-specific greedy heuristics with the same number of node expansions. Further, despite longer node evaluation times, RL guidance achieves better run-time performance than standard DIDP on three of four benchmark domains.
