Table of Contents
Fetching ...

g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks

Zihan Wang, Gim Hee Lee

TL;DR

This work introduces Generalizable 3D-Language Feature Fields (g3D-LF), a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks, and prepares a large-scale 3D-language dataset to align the representations of the feature fields with language.

Abstract

We introduce Generalizable 3D-Language Feature Fields (g3D-LF), a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks. Our g3D-LF processes posed RGB-D images from agents to encode feature fields for: 1) Novel view representation predictions from any position in the 3D scene; 2) Generations of BEV maps centered on the agent; 3) Querying targets using multi-granularity language within the above-mentioned representations. Our representation can be generalized to unseen environments, enabling real-time construction and dynamic updates. By volume rendering latent features along sampled rays and integrating semantic and spatial relationships through multiscale encoders, our g3D-LF produces representations at different scales and perspectives, aligned with multi-granularity language, via multi-level contrastive learning. Furthermore, we prepare a large-scale 3D-language dataset to align the representations of the feature fields with language. Extensive experiments on Vision-and-Language Navigation under both Panorama and Monocular settings, Zero-shot Object Navigation, and Situated Question Answering tasks highlight the significant advantages and effectiveness of our g3D-LF for embodied tasks.

g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks

TL;DR

This work introduces Generalizable 3D-Language Feature Fields (g3D-LF), a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks, and prepares a large-scale 3D-language dataset to align the representations of the feature fields with language.

Abstract

We introduce Generalizable 3D-Language Feature Fields (g3D-LF), a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks. Our g3D-LF processes posed RGB-D images from agents to encode feature fields for: 1) Novel view representation predictions from any position in the 3D scene; 2) Generations of BEV maps centered on the agent; 3) Querying targets using multi-granularity language within the above-mentioned representations. Our representation can be generalized to unseen environments, enabling real-time construction and dynamic updates. By volume rendering latent features along sampled rays and integrating semantic and spatial relationships through multiscale encoders, our g3D-LF produces representations at different scales and perspectives, aligned with multi-granularity language, via multi-level contrastive learning. Furthermore, we prepare a large-scale 3D-language dataset to align the representations of the feature fields with language. Extensive experiments on Vision-and-Language Navigation under both Panorama and Monocular settings, Zero-shot Object Navigation, and Situated Question Answering tasks highlight the significant advantages and effectiveness of our g3D-LF for embodied tasks.

Paper Structure

This paper contains 19 sections, 7 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Our g3D-LF uses posed RGB-D images from the agent to predict novel view and BEV map representations at various scales within the 3D scene, aligned with multi-granularity language through 3D-language pre-training. The representation is applicable to embodied tasks like visual navigation and embodied question answering, facilitating scene representation, language-guided querying, and navigation planning.
  • Figure 2: Overview of our g3D-LF model.Our model encodes the observed RGB-D images into the feature fields (consists of many feature points). Through aggregating k-nearest features, the MLP networks predict the latent feature and volume density of sampled points along the rendered ray. The hierarchical encoders further generate representations of novel view, panorama, and BEV map, then conduct multi-level contrastive learning with multi-granularity language.
  • Figure 3: Monocular VLN framework based on VLN-3DFF wang2024simtoreal.
  • Figure 4: Zero-shot object navigation framework based on VLFM yokoyama2024vlfm.
  • Figure 5: The framework of situated question answering masqa3d.
  • ...and 6 more figures