Table of Contents
Fetching ...

General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping

Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, Jason Baldridge

TL;DR

The paper tackles evaluation for instruction-conditioned navigation by replacing last-node-focused metrics with DTW-based path similarity. It introduces normalized DTW (nDTW) and a success-constrained variant (SDTW), showing superior alignment with human judgments and improved RL performance on VLN tasks. The methods support continuous and graph-based paths, with efficient computation, and SDTW proves especially effective for the more complex R4R setting. Overall, DTW-based measures provide a principled, scalable framework for evaluating and training navigation agents. The authors advocate widespread adoption of nDTW/SDTW for future VLN benchmarks.

Abstract

In instruction conditioned navigation, agents interpret natural language and their surroundings to navigate through an environment. Datasets for studying this task typically contain pairs of these instructions and reference trajectories. Yet, most evaluation metrics used thus far fail to properly account for the latter, relying instead on insufficient similarity comparisons. We address fundamental flaws in previously used metrics and show how Dynamic Time Warping (DTW), a long known method of measuring similarity between two time series, can be used for evaluation of navigation agents. For such, we define the normalized Dynamic Time Warping (nDTW) metric, that softly penalizes deviations from the reference path, is naturally sensitive to the order of the nodes composing each path, is suited for both continuous and graph-based evaluations, and can be efficiently calculated. Further, we define SDTW, which constrains nDTW to only successful paths. We collect human similarity judgments for simulated paths and find nDTW correlates better with human rankings than all other metrics. We also demonstrate that using nDTW as a reward signal for Reinforcement Learning navigation agents improves their performance on both the Room-to-Room (R2R) and Room-for-Room (R4R) datasets. The R4R results in particular highlight the superiority of SDTW over previous success-constrained metrics.

General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping

TL;DR

The paper tackles evaluation for instruction-conditioned navigation by replacing last-node-focused metrics with DTW-based path similarity. It introduces normalized DTW (nDTW) and a success-constrained variant (SDTW), showing superior alignment with human judgments and improved RL performance on VLN tasks. The methods support continuous and graph-based paths, with efficient computation, and SDTW proves especially effective for the more complex R4R setting. Overall, DTW-based measures provide a principled, scalable framework for evaluating and training navigation agents. The authors advocate widespread adoption of nDTW/SDTW for future VLN benchmarks.

Abstract

In instruction conditioned navigation, agents interpret natural language and their surroundings to navigate through an environment. Datasets for studying this task typically contain pairs of these instructions and reference trajectories. Yet, most evaluation metrics used thus far fail to properly account for the latter, relying instead on insufficient similarity comparisons. We address fundamental flaws in previously used metrics and show how Dynamic Time Warping (DTW), a long known method of measuring similarity between two time series, can be used for evaluation of navigation agents. For such, we define the normalized Dynamic Time Warping (nDTW) metric, that softly penalizes deviations from the reference path, is naturally sensitive to the order of the nodes composing each path, is suited for both continuous and graph-based evaluations, and can be efficiently calculated. Further, we define SDTW, which constrains nDTW to only successful paths. We collect human similarity judgments for simulated paths and find nDTW correlates better with human rankings than all other metrics. We also demonstrate that using nDTW as a reward signal for Reinforcement Learning navigation agents improves their performance on both the Room-to-Room (R2R) and Room-for-Room (R4R) datasets. The R4R results in particular highlight the superiority of SDTW over previous success-constrained metrics.

Paper Structure

This paper contains 9 sections, 3 equations, 3 figures, 3 tables, 2 algorithms.

Figures (3)

  • Figure 1: Illustration of two pairs of reference ($R=r_{1..|R|}$) and query ($Q=q_{1..|Q|}$) series (solid), and the optimal warping between them (dashed) when computing DTW.
  • Figure 2: Example comparison set with one reference path (blue) and five query paths (orange).
  • Figure 3: Examples of random reference (blue) and query (orange) paths, sorted by nDTW values.