Table of Contents
Fetching ...

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Sergio Arnaud, Paul McVay, Ada Martin, Arjun Majumdar, Krishna Murthy Jatavallabhula, Phillip Thomas, Ruslan Partsey, Daniel Dugas, Abha Gejji, Alexander Sax, Vincent-Pierre Berges, Mikael Henaff, Ayush Jain, Ang Cao, Ishita Prasad, Mrinal Kalakrishnan, Michael Rabbat, Nicolas Ballas, Mido Assran, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier

TL;DR

LOCATE 3D introduces 3D-JEPA, a self-supervised learning method for point clouds that contextualizes scene representations by predicting latent embeddings of masked regions. By lifting dense 2D foundation-model features (CLIP/DINO) into 3D and training a language-conditioned decoder, LOCATE 3D achieves state-of-the-art 3D referential grounding from sensor RGB-D streams without requiring mesh proposals. The LOCATE 3D Dataset (L3DD) provides extensive, diverse language annotations across multiple indoor scenes to study generalization and further improve performance (LOCATE 3D+). The approach demonstrates strong in-domain and out-of-domain generalization, with real-world robot deployment showing robust localization and end-to-end task success. These contributions collectively enable reliable, deployable 3D grounding for robotics and augmented reality applications.

Abstract

We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

TL;DR

LOCATE 3D introduces 3D-JEPA, a self-supervised learning method for point clouds that contextualizes scene representations by predicting latent embeddings of masked regions. By lifting dense 2D foundation-model features (CLIP/DINO) into 3D and training a language-conditioned decoder, LOCATE 3D achieves state-of-the-art 3D referential grounding from sensor RGB-D streams without requiring mesh proposals. The LOCATE 3D Dataset (L3DD) provides extensive, diverse language annotations across multiple indoor scenes to study generalization and further improve performance (LOCATE 3D+). The approach demonstrates strong in-domain and out-of-domain generalization, with real-world robot deployment showing robust localization and end-to-end task success. These contributions collectively enable reliable, deployable 3D grounding for robotics and augmented reality applications.

Abstract

We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.

Paper Structure

This paper contains 45 sections, 3 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Overall Architecture of Locate 3D, which operates in three phases. In Phase 1: Preprocessing, we construct a point cloud with "lifted" features from 2D foundation models, which provide local information. In Phase 2: Contextualized Representations, these lifted features are passed through the pre-trained 3D-JEPA encoder, which provides a contextualized representation for the whole scene. Finally, in Phase 3: 3D Localization, a 3D decoder head uses the text query and 3D-JEPA features to localize the referred object.
  • Figure 2: 3D-JEPA training framework: The context encoder computes latent features from a masked point cloud. Subsequently, a predictor operates on these latent features to predict the features of masked regions. The target encoder has the same architecture as context encoder with weights being the exponentated moving average of context encoder over course of training. The loss is computed per point in the embedding space and averaged across all points that were masked.
  • Figure 3: In our language-conditioned 3D mask and bounding box decoder, 3D-JEPA features are jointly processed with text and learned query embeddings by $n=8$ decoder blocks and specialized prediction heads that generate mask, token, and box predictions. N is the number of points in the input pointcloud, T is the number of tokens in the input text, Q is the number of generated model queries, E is the decoder feature dimension, F is the input text feature dimension, J is the 3D JEPA feature dimension.
  • Figure 4: Overview of different masking types
  • Figure 5: Element-wise probe results during pre-training with different masking strategies
  • ...and 7 more figures