Table of Contents
Fetching ...

Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding

Chancharik Mitra, Abrar Anwar, Rodolfo Corona, Dan Klein, Trevor Darrell, Jesse Thomason

TL;DR

MAGiC addresses the problem of grounding natural language in 3D embodied environments by exploiting contextual differences between visually similar objects and multiple viewpoints. It introduces a transformer-based architecture that jointly reasons over two candidate objects and their views using CLIP image and language embeddings, incorporating attention masking to improve robustness. The approach yields clear gains on the SNARE benchmark, with ablations showing that both object-level context and multi-view context are essential, and that explicit 3D features are not strictly necessary for improvements. These results highlight a principled design for 3D language grounding that leverages cross-object and multi-view information to enhance robustness and accuracy in embodied AI tasks.

Abstract

When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an object's appearance can vary with camera position. As such, we present the Multi-view Approach to Grounding in Context (MAGiC), which selects an object referent based on language that distinguishes between two similar objects. By pragmatically reasoning over both objects and across multiple views of those objects, MAGiC improves over the state-of-the-art model on the SNARE object reference task with a relative error reduction of 12.9\% (representing an absolute improvement of 2.7\%). Ablation studies show that reasoning jointly over object referent candidates and multiple views of each object both contribute to improved accuracy. Code: https://github.com/rcorona/magic_snare/

Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding

TL;DR

MAGiC addresses the problem of grounding natural language in 3D embodied environments by exploiting contextual differences between visually similar objects and multiple viewpoints. It introduces a transformer-based architecture that jointly reasons over two candidate objects and their views using CLIP image and language embeddings, incorporating attention masking to improve robustness. The approach yields clear gains on the SNARE benchmark, with ablations showing that both object-level context and multi-view context are essential, and that explicit 3D features are not strictly necessary for improvements. These results highlight a principled design for 3D language grounding that leverages cross-object and multi-view information to enhance robustness and accuracy in embodied AI tasks.

Abstract

When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an object's appearance can vary with camera position. As such, we present the Multi-view Approach to Grounding in Context (MAGiC), which selects an object referent based on language that distinguishes between two similar objects. By pragmatically reasoning over both objects and across multiple views of those objects, MAGiC improves over the state-of-the-art model on the SNARE object reference task with a relative error reduction of 12.9\% (representing an absolute improvement of 2.7\%). Ablation studies show that reasoning jointly over object referent candidates and multiple views of each object both contribute to improved accuracy. Code: https://github.com/rcorona/magic_snare/
Paper Structure (23 sections, 1 equation, 7 figures, 2 tables)

This paper contains 23 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Left: Previous methods for identifying object referents of language expressions in the SNARE benchmark consider target and distractor objects independently and pool multiple views before grounding. Right: By contrast, MAGiC jointly reasons over target and distractor objects and their views from different angles to identify the correct referent with higher accuracy than the previous state-of-the-art model.
  • Figure 2: Model Architecture. MAGiC consists of a multi-view transformer that attends to CLIP language embeddings for the description and CLIP image embeddings across multiple views for both objects. This transformer allows our model to contextually reason across views about both objects at the same time with respect to a language description. We do not use any positional encodings, and MAGiC is invariant to the input order of images and objects. Unlike previous methods for SNARE, we pool information from object views only after updating their representations with respect to the language referring expression. We apply view masking and language masking augmentations to regularize the model during training.
  • Figure 3: Explicit 3D Features. We find that adding 3D structural information to MAGiC does not improve accuracy on SNARE.
  • Figure 4: Fewer Views Impact on Performance. We report results on the validation set on the impact of fewer views on performance. We find that MAGiC outperforms MATCH, LAGOR, and VLG, achieving greater accuracy with fewer views.
  • Figure 5: View and language masking. We show the impact of different attention masking percentages for the view and language tokens that are input into MAGiC. Each variant is trained for 10 seeds. We find that 10% view masking and 20% language masking achieved the highest validation set accuracy.
  • ...and 2 more figures