Table of Contents
Fetching ...

Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Mubarak Shah, Yan Yan

TL;DR

This work introduces 3D Intention Grounding (3D-IG), a task that detects 3D objects in RGB-D scans guided by free-form human intention. It provides the Intent3D dataset with 44,990 intention texts tied to 209 object classes across 1,042 ScanNet scenes, generated via a GPT-4-based pipeline and cleaned for quality. The proposed IntentNet model combines multimodal feature extraction, verb–object reasoning, candidate box matching, and a cascaded adaptive learning scheme to jointly reason about intention and detection. Empirical results show IntentNet achieving state-of-the-art performance on Intent3D and demonstrate the value of explicit intention understanding for 3D grounding and downstream embodied perception.

Abstract

In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back". Closely related, 3D visual grounding focuses on understanding human reference. To achieve detection based on human intention, it relies on humans to observe the scene, reason out the target that aligns with their intention ("pillow" in this case), and finally provide a reference to the AI system, such as "A pillow on the couch". Instead, 3D intention grounding challenges AI agents to automatically observe, reason and detect the desired target solely based on human intention. To tackle this challenge, we introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset. We also establish several baselines based on different language-based 3D object detection models on our benchmark. Finally, we propose IntentNet, our unique approach, designed to tackle this intention-based detection problem. It focuses on three key aspects: intention understanding, reasoning to identify object candidates, and cascaded adaptive learning that leverages the intrinsic priority logic of different losses for multiple objective optimization. Project Page: https://weitaikang.github.io/Intent3D-webpage/

Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

TL;DR

This work introduces 3D Intention Grounding (3D-IG), a task that detects 3D objects in RGB-D scans guided by free-form human intention. It provides the Intent3D dataset with 44,990 intention texts tied to 209 object classes across 1,042 ScanNet scenes, generated via a GPT-4-based pipeline and cleaned for quality. The proposed IntentNet model combines multimodal feature extraction, verb–object reasoning, candidate box matching, and a cascaded adaptive learning scheme to jointly reason about intention and detection. Empirical results show IntentNet achieving state-of-the-art performance on Intent3D and demonstrate the value of explicit intention understanding for 3D grounding and downstream embodied perception.

Abstract

In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back". Closely related, 3D visual grounding focuses on understanding human reference. To achieve detection based on human intention, it relies on humans to observe the scene, reason out the target that aligns with their intention ("pillow" in this case), and finally provide a reference to the AI system, such as "A pillow on the couch". Instead, 3D intention grounding challenges AI agents to automatically observe, reason and detect the desired target solely based on human intention. To tackle this challenge, we introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset. We also establish several baselines based on different language-based 3D object detection models on our benchmark. Finally, we propose IntentNet, our unique approach, designed to tackle this intention-based detection problem. It focuses on three key aspects: intention understanding, reasoning to identify object candidates, and cascaded adaptive learning that leverages the intrinsic priority logic of different losses for multiple objective optimization. Project Page: https://weitaikang.github.io/Intent3D-webpage/
Paper Structure (35 sections, 3 equations, 17 figures, 5 tables)

This paper contains 35 sections, 3 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: We introduce 3D intention grounding (right), a new task for detecting the object of interest using a 3D bounding box in a 3D scene, guided by human intention expressed in text (e.g., " I want something to support my back to relieve the pressure"). In contrast, the existing 3D visual grounding (left) relies on human reasoning and references for detection. The illustration clearly distinguishes that observation and reasoning are manually executed by human (left) and automated by AI (right).
  • Figure 2: (Upper row) Flowchart for dataset construction. After constructing the scene graph, we select objects by three criteria: Common Object, Non-trivial Object, Unambiguous Object. We use ChatGPT to generate intention texts given the prompt we designed. Finally, we manually clean the data. (Lower row) Examples in our dataset for different number of targets and length of texts.
  • Figure 3: (a) The distribution of the text lengths of all intentions; (b) Word cloud of verbs used in all intention texts; (c) Word cloud of nouns used in all intention texts; (d) Number of different verbs used for each fine-grained class; (e) Number of different nouns used for each fine-grained class.
  • Figure 4: IntentNet: ( Backbones) PointNet++ is used to extract point features, MLP encodes the boxes predicted by a 3D object detector, and RoBERTa encodes the text input. ( Encoder) Attention-based blocks are used for multimodal fusion, enhancing box features through integration with text features. ( Decoder) Point features with top-k confidence are selected as proposed queries and then updated by attention-based blocks. Several MLPs are used to linearly project the queries for subsequent loss calculations. ( Losses) The model learns to match the candidate boxes with target objects using $L_{bce}$. Queries are trained to identify verbs ($L_{vPos}$), align with verbs ($L_{vSem}$), and align with objects ($L_{voSem}$). Cascaded adaptive factors, which hierarchically weigh each loss based on its dependency on previous losses, are used for optimization. The colored boxes in the loss blocks represent different feature tokens.
  • Figure 5: Qualitative results of ablation studies. " Verb" indicates the alignment with verb tokens. " Verb2Obj" indicates the alignment with object token in the sentence when the queries are modulated by the verb. " MatchBox" indicates the Candidate Box Matching, " Adapt" indicates the Cascaded Adaptive Learning. Red boxes indicate the ground truth, Blue boxes indicate the counterpart model's prediction, and Green boxes indicate our model's prediction.
  • ...and 12 more figures