Table of Contents
Fetching ...

Vision-Language Navigation with Energy-Based Policy

Rui Liu, Wenguan Wang, Yi Yang

TL;DR

An Energy-based Navigation Policy (ENP) is proposed to model the joint state-action distribution using an energy-based model and learns to globally align with the expert policy by maximizing the likelihood of the actions and modeling the dynamics of the navigation states in a collaborative manner.

Abstract

Vision-language navigation (VLN) requires an agent to execute actions following human instructions. Existing VLN models are optimized through expert demonstrations by supervised behavioural cloning or incorporating manual reward engineering. While straightforward, these efforts overlook the accumulation of errors in the Markov decision process, and struggle to match the distribution of the expert policy. Going beyond this, we propose an Energy-based Navigation Policy (ENP) to model the joint state-action distribution using an energy-based model. At each step, low energy values correspond to the state-action pairs that the expert is most likely to perform, and vice versa. Theoretically, the optimization objective is equivalent to minimizing the forward divergence between the occupancy measure of the expert and ours. Consequently, ENP learns to globally align with the expert policy by maximizing the likelihood of the actions and modeling the dynamics of the navigation states in a collaborative manner. With a variety of VLN architectures, ENP achieves promising performances on R2R, REVERIE, RxR, and R2R-CE, unleashing the power of existing VLN models.

Vision-Language Navigation with Energy-Based Policy

TL;DR

An Energy-based Navigation Policy (ENP) is proposed to model the joint state-action distribution using an energy-based model and learns to globally align with the expert policy by maximizing the likelihood of the actions and modeling the dynamics of the navigation states in a collaborative manner.

Abstract

Vision-language navigation (VLN) requires an agent to execute actions following human instructions. Existing VLN models are optimized through expert demonstrations by supervised behavioural cloning or incorporating manual reward engineering. While straightforward, these efforts overlook the accumulation of errors in the Markov decision process, and struggle to match the distribution of the expert policy. Going beyond this, we propose an Energy-based Navigation Policy (ENP) to model the joint state-action distribution using an energy-based model. At each step, low energy values correspond to the state-action pairs that the expert is most likely to perform, and vice versa. Theoretically, the optimization objective is equivalent to minimizing the forward divergence between the occupancy measure of the expert and ours. Consequently, ENP learns to globally align with the expert policy by maximizing the likelihood of the actions and modeling the dynamics of the navigation states in a collaborative manner. With a variety of VLN architectures, ENP achieves promising performances on R2R, REVERIE, RxR, and R2R-CE, unleashing the power of existing VLN models.

Paper Structure

This paper contains 16 sections, 12 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of behavioural cloning (BC) and Enp for VLN. Previous methods use BC to optimize the conditional action distribution directly. Enp models the joint state-action distribution through an energy-based model (§\ref{['sec_method']}). The low energy values correspond to the state-action pairs that the expert is most likely to perform.
  • Figure 2: Overview of Enp. At each step $t$, the agent acquires a series of observations, and predicts the next step based on the instruction and navigation history. Enp optimizes the marginal state matching loss $\mathcal{L}_{\mathcal{S}}$ through SGLD sampling from Marginal State Memory (Eq. \ref{['eq_mcmc']}), and minimizes the cross-entropy loss $\mathcal{L}_{\pi}$ jointly (Eq. \ref{['eq_bcloss']}).
  • Figure 3: Qualitative results on R2R AndersonWTB0S0G18 (§\ref{['sec_experiment']}). (a) DUET chen2022think arrives in the wrong room instead of 'recreation room' since the scene contains multiple rooms. Our agent reaches the goal successfully, demonstrating better decision-making ability. (b) Failure case: Due to partial observability and occlusion of the environment, it is hard to find 'kitchen' at some positions. Thus our agent goes the wrong way and ends in failure (§\ref{['ex_vln']}).
  • Figure 4: The average success rate of Enp and BC chen2022think across different trajectory lengths.