Table of Contents
Fetching ...

Volumetric Environment Representation for Vision-Language Navigation

Rui Liu, Wenguan Wang, Yi Yang

TL;DR

This work introduces a Volumetric Environ-ment Representation (VER), which voxelizes the physical world into structured 3D cells and achieves state-of-the-art performance across VLN benchmarks (R2R, REVERIE, and R4R).

Abstract

Vision-language navigation (VLN) requires an agent to navigate through an 3D environment based on visual observations and natural language instructions. It is clear that the pivotal factor for successful navigation lies in the comprehensive scene understanding. Previous VLN agents employ monocular frameworks to extract 2D features of perspective views directly. Though straightforward, they struggle for capturing 3D geometry and semantics, leading to a partial and incomplete environment representation. To achieve a comprehensive 3D representation with fine-grained details, we introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. For each cell, VER aggregates multi-view 2D features into such a unified 3D space via 2D-3D sampling. Through coarse-to-fine feature extraction and multi-task learning for VER, our agent predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly. Based on online collected VERs, our agent performs volume state estimation and builds episodic memory for predicting the next step. Experimental results show our environment representations from multi-task learning lead to evident performance gains on VLN. Our model achieves state-of-the-art performance across VLN benchmarks (R2R, REVERIE, and R4R).

Volumetric Environment Representation for Vision-Language Navigation

TL;DR

This work introduces a Volumetric Environ-ment Representation (VER), which voxelizes the physical world into structured 3D cells and achieves state-of-the-art performance across VLN benchmarks (R2R, REVERIE, and R4R).

Abstract

Vision-language navigation (VLN) requires an agent to navigate through an 3D environment based on visual observations and natural language instructions. It is clear that the pivotal factor for successful navigation lies in the comprehensive scene understanding. Previous VLN agents employ monocular frameworks to extract 2D features of perspective views directly. Though straightforward, they struggle for capturing 3D geometry and semantics, leading to a partial and incomplete environment representation. To achieve a comprehensive 3D representation with fine-grained details, we introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. For each cell, VER aggregates multi-view 2D features into such a unified 3D space via 2D-3D sampling. Through coarse-to-fine feature extraction and multi-task learning for VER, our agent predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly. Based on online collected VERs, our agent performs volume state estimation and builds episodic memory for predicting the next step. Experimental results show our environment representations from multi-task learning lead to evident performance gains on VLN. Our model achieves state-of-the-art performance across VLN benchmarks (R2R, REVERIE, and R4R).
Paper Structure (18 sections, 14 equations, 7 figures, 8 tables)

This paper contains 18 sections, 14 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The agent observes its surroundings with corresponding perspective features of different candidate views (). Previous methods construct the topological graph or semantic map based on these 2D features. Our VER aggregates the multi-view features into structured 3D cells via 2D-3D sampling. VER is a powerful representation for both 3D perception tasks and VLN, providing a volume state space for decision-making.
  • Figure 2: Overview of our model. Given the perspective features of candidate views, a group of 3D queries are used to sample and aggregate them into VER (§\ref{['sec_verencode']}). To encode VER, we adopt coarse-to-fine extraction and perform multi-task learning on 3D perception. Based on VER, a volume state estimation module is proposed to predict state transition (§\ref{['sec_stateestimate']}). The episodic memory is used to store past observations using neighboring pillar representations for each viewpoint (§\ref{['sec_action']}). For decision-making, our agent combines both the local action probabilities from the volume state and the global action probabilities obtained from the episodic memory. See §\ref{['sec_approach']} for more details.
  • Figure 3: Our coarse-to-fine VER representation extraction (§\ref{['sec_verencode']}) adopts cascade up-sampling operations with 3D deconvolutions (Eq. \ref{['eq_upsample']}) and 3D queries (Eq. \ref{['eq_sample']}). The training process is supervised at different scales by multi-resolution semantic labels.
  • Figure 4: A representative visual result on val unseen of R2R AndersonWTB0S0G18. We first visualize the 3D occupancy prediction at the key steps. In addition, we provide the prediction of 3D boxes and 3D room layout at step . We find that VER can capture the geometric details of 'couch' and the structure of 'bedroom'. With VER, our agent easily finds the 'bed' and succeeds. See §\ref{['ex_vln']} for more details.
  • Figure 5: Visualization of multi-resolution occupancy prediction (more details in §\ref{['sec_verencode']}).
  • ...and 2 more figures