Table of Contents
Fetching ...

Learning to Select Goals in Automated Planning with Deep-Q Learning

Carlos Núñez-Molina, Juan Fernández-Olivares, Raúl Pérez

TL;DR

This paper addresses real-time constrained automated planning by integrating a subgoal-selection mechanism learned with Deep Q-Learning into a planning-enabled agent. The authors formulate goal selection as a deterministic MDP (M^g) and deploy a CNN to predict the remaining plan length for each candidate subgoal, enabling efficient subgoal choice that is executed by a standard PDDL planner. Empirical results show that the approach (DQP) is substantially more sample-efficient than vanilla Deep Q-Learning, generalizes across GVGAI Boulder Dash levels, and dramatically reduces planning time compared to a state-of-the-art planner, solving all test levels within about 2 seconds. These findings demonstrate the value of combining deliberative planning with learned subgoal selection to achieve fast, scalable, and generalizable intelligent behavior in real-time environments, with potential extensions to uncertain or dynamic settings.

Abstract

In this work we propose a planning and acting architecture endowed with a module which learns to select subgoals with Deep Q-Learning. This allows us to decrease the load of a planner when faced with scenarios with real-time restrictions. We have trained this architecture on a video game environment used as a standard test-bed for intelligent systems applications, testing it on different levels of the same game to evaluate its generalization abilities. We have measured the performance of our approach as more training data is made available, as well as compared it with both a state-of-the-art, classical planner and the standard Deep Q-Learning algorithm. The results obtained show our model performs better than the alternative methods considered, when both plan quality (plan length) and time requirements are taken into account. On the one hand, it is more sample-efficient than standard Deep Q-Learning, and it is able to generalize better across levels. On the other hand, it reduces problem-solving time when compared with a state-of-the-art automated planner, at the expense of obtaining plans with only 9% more actions.

Learning to Select Goals in Automated Planning with Deep-Q Learning

TL;DR

This paper addresses real-time constrained automated planning by integrating a subgoal-selection mechanism learned with Deep Q-Learning into a planning-enabled agent. The authors formulate goal selection as a deterministic MDP (M^g) and deploy a CNN to predict the remaining plan length for each candidate subgoal, enabling efficient subgoal choice that is executed by a standard PDDL planner. Empirical results show that the approach (DQP) is substantially more sample-efficient than vanilla Deep Q-Learning, generalizes across GVGAI Boulder Dash levels, and dramatically reduces planning time compared to a state-of-the-art planner, solving all test levels within about 2 seconds. These findings demonstrate the value of combining deliberative planning with learned subgoal selection to achieve fast, scalable, and generalizable intelligent behavior in real-time environments, with potential extensions to uncertain or dynamic settings.

Abstract

In this work we propose a planning and acting architecture endowed with a module which learns to select subgoals with Deep Q-Learning. This allows us to decrease the load of a planner when faced with scenarios with real-time restrictions. We have trained this architecture on a video game environment used as a standard test-bed for intelligent systems applications, testing it on different levels of the same game to evaluate its generalization abilities. We have measured the performance of our approach as more training data is made available, as well as compared it with both a state-of-the-art, classical planner and the standard Deep Q-Learning algorithm. The results obtained show our model performs better than the alternative methods considered, when both plan quality (plan length) and time requirements are taken into account. On the one hand, it is more sample-efficient than standard Deep Q-Learning, and it is able to generalize better across levels. On the other hand, it reduces problem-solving time when compared with a state-of-the-art automated planner, at the expense of obtaining plans with only 9% more actions.
Paper Structure (17 sections, 2 equations, 4 figures, 1 table)

This paper contains 17 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: A level of the Boulder Dash game.
  • Figure 2: An overview of the planning and acting architecture.
  • Figure 3: CNN architecture of the Goal Selection Module. This diagram shows how the size of the one-hot tensor changes as it passes through the layers of the network. The CNN receives an input of size $(30,30,7)$ corresponding to the one-hot tensor of a given $(s,g)$ pair and outputs a single prediction which represents the Q-value $Q(s,g)$.
  • Figure 4: Plan quality of the DQP model for different dataset sizes. This plot shows the average action coefficient (lower is better) of the DQP model as the number of training levels is increased. Each error bar represents an interval of $\pm1$ standard deviation.