Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding
Chancharik Mitra, Abrar Anwar, Rodolfo Corona, Dan Klein, Trevor Darrell, Jesse Thomason
TL;DR
MAGiC addresses the problem of grounding natural language in 3D embodied environments by exploiting contextual differences between visually similar objects and multiple viewpoints. It introduces a transformer-based architecture that jointly reasons over two candidate objects and their views using CLIP image and language embeddings, incorporating attention masking to improve robustness. The approach yields clear gains on the SNARE benchmark, with ablations showing that both object-level context and multi-view context are essential, and that explicit 3D features are not strictly necessary for improvements. These results highlight a principled design for 3D language grounding that leverages cross-object and multi-view information to enhance robustness and accuracy in embodied AI tasks.
Abstract
When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an object's appearance can vary with camera position. As such, we present the Multi-view Approach to Grounding in Context (MAGiC), which selects an object referent based on language that distinguishes between two similar objects. By pragmatically reasoning over both objects and across multiple views of those objects, MAGiC improves over the state-of-the-art model on the SNARE object reference task with a relative error reduction of 12.9\% (representing an absolute improvement of 2.7\%). Ablation studies show that reasoning jointly over object referent candidates and multiple views of each object both contribute to improved accuracy. Code: https://github.com/rcorona/magic_snare/
