Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D
Sergio Arnaud, Paul McVay, Ada Martin, Arjun Majumdar, Krishna Murthy Jatavallabhula, Phillip Thomas, Ruslan Partsey, Daniel Dugas, Abha Gejji, Alexander Sax, Vincent-Pierre Berges, Mikael Henaff, Ayush Jain, Ang Cao, Ishita Prasad, Mrinal Kalakrishnan, Michael Rabbat, Nicolas Ballas, Mido Assran, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier
TL;DR
LOCATE 3D introduces 3D-JEPA, a self-supervised learning method for point clouds that contextualizes scene representations by predicting latent embeddings of masked regions. By lifting dense 2D foundation-model features (CLIP/DINO) into 3D and training a language-conditioned decoder, LOCATE 3D achieves state-of-the-art 3D referential grounding from sensor RGB-D streams without requiring mesh proposals. The LOCATE 3D Dataset (L3DD) provides extensive, diverse language annotations across multiple indoor scenes to study generalization and further improve performance (LOCATE 3D+). The approach demonstrates strong in-domain and out-of-domain generalization, with real-world robot deployment showing robust localization and end-to-end task success. These contributions collectively enable reliable, deployable 3D grounding for robotics and augmented reality applications.
Abstract
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.
