MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
Chun-Peng Chang, Shaoxiang Wang, Alain Pagani, Didier Stricker
TL;DR
MiKASA tackles 3D visual grounding by integrating a scene-aware object encoder with a multi-key-anchor spatial reasoning scheme. It adopts an end-to-end trainable pipeline with a text encoder, vision module, spatial module, and a late fusion module that yields two interpretable outputs: a category score and a spatial score. The model outperforms prior methods on Referit3D datasets, especially in view-dependent scenarios, and provides improved explainability for error diagnosis. The contributions offer a scalable approach to grounding in cluttered 3D scenes and lay groundwork for task-specific spatial reasoning with anchors.
Abstract
3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.
