Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Mubarak Shah, Yan Yan
TL;DR
This work introduces 3D Intention Grounding (3D-IG), a task that detects 3D objects in RGB-D scans guided by free-form human intention. It provides the Intent3D dataset with 44,990 intention texts tied to 209 object classes across 1,042 ScanNet scenes, generated via a GPT-4-based pipeline and cleaned for quality. The proposed IntentNet model combines multimodal feature extraction, verb–object reasoning, candidate box matching, and a cascaded adaptive learning scheme to jointly reason about intention and detection. Empirical results show IntentNet achieving state-of-the-art performance on Intent3D and demonstrate the value of explicit intention understanding for 3D grounding and downstream embodied perception.
Abstract
In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back". Closely related, 3D visual grounding focuses on understanding human reference. To achieve detection based on human intention, it relies on humans to observe the scene, reason out the target that aligns with their intention ("pillow" in this case), and finally provide a reference to the AI system, such as "A pillow on the couch". Instead, 3D intention grounding challenges AI agents to automatically observe, reason and detect the desired target solely based on human intention. To tackle this challenge, we introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset. We also establish several baselines based on different language-based 3D object detection models on our benchmark. Finally, we propose IntentNet, our unique approach, designed to tackle this intention-based detection problem. It focuses on three key aspects: intention understanding, reasoning to identify object candidates, and cascaded adaptive learning that leverages the intrinsic priority logic of different losses for multiple objective optimization. Project Page: https://weitaikang.github.io/Intent3D-webpage/
