Table of Contents
Fetching ...

MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding

Chun-Peng Chang, Shaoxiang Wang, Alain Pagani, Didier Stricker

TL;DR

MiKASA tackles 3D visual grounding by integrating a scene-aware object encoder with a multi-key-anchor spatial reasoning scheme. It adopts an end-to-end trainable pipeline with a text encoder, vision module, spatial module, and a late fusion module that yields two interpretable outputs: a category score and a spatial score. The model outperforms prior methods on Referit3D datasets, especially in view-dependent scenarios, and provides improved explainability for error diagnosis. The contributions offer a scalable approach to grounding in cluttered 3D scenes and lay groundwork for task-specific spatial reasoning with anchors.

Abstract

3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.

MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding

TL;DR

MiKASA tackles 3D visual grounding by integrating a scene-aware object encoder with a multi-key-anchor spatial reasoning scheme. It adopts an end-to-end trainable pipeline with a text encoder, vision module, spatial module, and a late fusion module that yields two interpretable outputs: a category score and a spatial score. The model outperforms prior methods on Referit3D datasets, especially in view-dependent scenarios, and provides improved explainability for error diagnosis. The contributions offer a scalable approach to grounding in cluttered 3D scenes and lay groundwork for task-specific spatial reasoning with anchors.

Abstract

3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.
Paper Structure (23 sections, 9 equations, 5 figures, 4 tables)

This paper contains 23 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Our methodology utilizes a dual-prediction framework for 3D visual grounding. First, we assign a target category score based on object categorization, as detailed in \ref{['fig:category']}. Next, a spatial score is integrated according to the object's alignment with the textual description, as shown in \ref{['fig:cate_sp']}.
  • Figure 2: Architecture of our 3D Visual Grounding Model, which includes four main modules: a text encoder (Bert), a vision module with a scene-aware object encoder, a spatial module that fuses spatial and textual data, and a multi-layered fusion module. The fusion module combines text, spatial, and object features, employing a dual-scoring system for enhanced object category identification and spatial-language assessment.
  • Figure 3: Our novel spatial module captures relative spatial information from a single viewpoint by treating each object in the scene as a potential anchor. This approach generates unique spatial maps, each offering a different perspective of the scene. These maps are then undergo feature augmentation, where distances and angles are calculated, followed by normalization and scaling. Subsequently, a MLP layer is employed to transform these low-dimensional features into higher-dimensional ones for effective fusion with textual data.
  • Figure 4: Our novel attention-based spatial feature aggregation. Each map designates a different object as the target, while treating all other objects as anchors. The importance of each anchor relative to the potential target object is represented in row $i$ of the score matrix, indicating the relevance of each anchor in the context of the target, where $W_S$ and $W_F$ are learnable weight matrices.
  • Figure 5: Visual representation of the model's decision-making process in diverse situations. Rows, from top to bottom, depict: (1) Choices determined by category score, (2) Choices determined by spatial score, (3) Our model's final selection after combining both scores, and (4) The established ground truth. Columns from left to right showcase varying scenarios. The green bounding box refers to the chosen object, and the red bounding box refers to the unchosen distractors.