Table of Contents
Fetching ...

Navigation with QPHIL: Quantizing Planner for Hierarchical Implicit Q-Learning

Alexi Canesse, Mathieu Petitbois, Ludovic Denoyer, Sylvain Lamprier, Rémy Portelas

TL;DR

This work presents a novel hierarchical transformer-based approach leveraging a learned quantizer of the state space to tackle long horizon navigation tasks and achieves state-of-the-art results in complex long-distance navigation environments.

Abstract

Offline Reinforcement Learning (RL) has emerged as a powerful alternative to imitation learning for behavior modeling in various domains, particularly in complex navigation tasks. An existing challenge with Offline RL is the signal-to-noise ratio, i.e. how to mitigate incorrect policy updates due to errors in value estimates. Towards this, multiple works have demonstrated the advantage of hierarchical offline RL methods, which decouples high-level path planning from low-level path following. In this work, we present a novel hierarchical transformer-based approach leveraging a learned quantizer of the space. This quantization enables the training of a simpler zone-conditioned low-level policy and simplifies planning, which is reduced to discrete autoregressive prediction. Among other benefits, zone-level reasoning in planning enables explicit trajectory stitching rather than implicit stitching based on noisy value function estimates. By combining this transformer-based planner with recent advancements in offline RL, our proposed approach achieves state-of-the-art results in complex long-distance navigation environments.

Navigation with QPHIL: Quantizing Planner for Hierarchical Implicit Q-Learning

TL;DR

This work presents a novel hierarchical transformer-based approach leveraging a learned quantizer of the state space to tackle long horizon navigation tasks and achieves state-of-the-art results in complex long-distance navigation environments.

Abstract

Offline Reinforcement Learning (RL) has emerged as a powerful alternative to imitation learning for behavior modeling in various domains, particularly in complex navigation tasks. An existing challenge with Offline RL is the signal-to-noise ratio, i.e. how to mitigate incorrect policy updates due to errors in value estimates. Towards this, multiple works have demonstrated the advantage of hierarchical offline RL methods, which decouples high-level path planning from low-level path following. In this work, we present a novel hierarchical transformer-based approach leveraging a learned quantizer of the space. This quantization enables the training of a simpler zone-conditioned low-level policy and simplifies planning, which is reduced to discrete autoregressive prediction. Among other benefits, zone-level reasoning in planning enables explicit trajectory stitching rather than implicit stitching based on noisy value function estimates. By combining this transformer-based planner with recent advancements in offline RL, our proposed approach achieves state-of-the-art results in complex long-distance navigation environments.

Paper Structure

This paper contains 41 sections, 16 equations, 15 figures, 7 tables, 2 algorithms.

Figures (15)

  • Figure 1: QPHIL relies on learning discrete landmarks (zones) to reduce navigation to high-level landmark sequence generation and low-level landmark-conditioned path-following.
  • Figure 2: Motivations behind QPHIL (a) QPHIL aims to simplify the planning of the subgoals by leveraging discrete tokens. (b) By doing so, QPHIL avoids the noisy high-frequency target subgoal updates by updating the subgoal of the low-level policy after each landmark traversal only. (c) The subgoal reaching tasks are less demanding in conditioning for the low policy as it corresponds to the reaching of an entier subzone instead of a precise subgoal.
  • Figure 3: Inference pipeline of QPHIL (open-loop version, without replanning). Subgoal tokens are consumed from the sequence after each corresponding landmark is reached.
  • Figure 4: Tokenization example. Each token has an associated color. The color of each point in the background correspond to the color of the associated token. It is noticeable that the tokens align with the walls thanks to the contrastive loss.
  • Figure 5: Exemple of tokenizations using antmaze environements. Each token has an associated color. The color of each point in the background corresponds to the color of the associated token.
  • ...and 10 more figures