Table of Contents
Fetching ...

Multi-Task Domain Adaptation for Language Grounding with 3D Objects

Penglei Sun, Yaoxian Song, Xinglin Pan, Peijie Dong, Xiaofei Yang, Qiang Wang, Zhixu Li, Tiefeng Li, Xiaowen Chu

TL;DR

The paper tackles visual language grounding for 3D objects under domain shift and data scarcity. It introduces DA4LG, a domain-adaptive, multi-task framework with a Domain-specific Encoder and a pseudo-Siamese visual branch to align vision-language representations across domains without extra data. Through VL-contrastive and captioning auxiliary tasks in addition to the primary Language Grounding task, DA4LG achieves state-of-the-art results on SNARE ($86.8\%$ multi-view, $83.8\%$ single-view) and demonstrates strong generalization in Simulation-SNARE. The approach reduces the domain gap and improves cross-modal grounding reliability with a compact model (~79.5M parameters), offering practical benefits for embodied agents operating across varied visual domains.

Abstract

The existing works on object-level language grounding with 3D objects mostly focus on improving performance by utilizing the off-the-shelf pre-trained models to capture features, such as viewpoint selection or geometric priors. However, they have failed to consider exploring the cross-modal representation of language-vision alignment in the cross-domain field. To answer this problem, we propose a novel method called Domain Adaptation for Language Grounding (DA4LG) with 3D objects. Specifically, the proposed DA4LG consists of a visual adapter module with multi-task learning to realize vision-language alignment by comprehensive multimodal feature representation. Experimental results demonstrate that DA4LG competitively performs across visual and non-visual language descriptions, independent of the completeness of observation. DA4LG achieves state-of-the-art performance in the single-view setting and multi-view setting with the accuracy of 83.8% and 86.8% respectively in the language grounding benchmark SNARE. The simulation experiments show the well-practical and generalized performance of DA4LG compared to the existing methods. Our project is available at https://sites.google.com/view/da4lg.

Multi-Task Domain Adaptation for Language Grounding with 3D Objects

TL;DR

The paper tackles visual language grounding for 3D objects under domain shift and data scarcity. It introduces DA4LG, a domain-adaptive, multi-task framework with a Domain-specific Encoder and a pseudo-Siamese visual branch to align vision-language representations across domains without extra data. Through VL-contrastive and captioning auxiliary tasks in addition to the primary Language Grounding task, DA4LG achieves state-of-the-art results on SNARE ( multi-view, single-view) and demonstrates strong generalization in Simulation-SNARE. The approach reduces the domain gap and improves cross-modal grounding reliability with a compact model (~79.5M parameters), offering practical benefits for embodied agents operating across varied visual domains.

Abstract

The existing works on object-level language grounding with 3D objects mostly focus on improving performance by utilizing the off-the-shelf pre-trained models to capture features, such as viewpoint selection or geometric priors. However, they have failed to consider exploring the cross-modal representation of language-vision alignment in the cross-domain field. To answer this problem, we propose a novel method called Domain Adaptation for Language Grounding (DA4LG) with 3D objects. Specifically, the proposed DA4LG consists of a visual adapter module with multi-task learning to realize vision-language alignment by comprehensive multimodal feature representation. Experimental results demonstrate that DA4LG competitively performs across visual and non-visual language descriptions, independent of the completeness of observation. DA4LG achieves state-of-the-art performance in the single-view setting and multi-view setting with the accuracy of 83.8% and 86.8% respectively in the language grounding benchmark SNARE. The simulation experiments show the well-practical and generalized performance of DA4LG compared to the existing methods. Our project is available at https://sites.google.com/view/da4lg.
Paper Structure (20 sections, 8 equations, 4 figures, 6 tables)

This paper contains 20 sections, 8 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The comparison between existing works and our model. Existing works focus on (a) multi-view perception and (b) external prior. (c) We approach language grounding from domain adaptation.
  • Figure 2: The framework of DA4LG. DA4LG is comprised of Encoder Layer, Embedding Reweighting Layer and Embedding Fusion Layer. Encoder Layer contains three encoders: Language Encoder (L. Encoder), Vision Encoder, and Domain-specific Encoder. The snowflake and fire denote the freezing and unfreezing respectively.
  • Figure 3: Visualization of examples: Original images of the objects are displayed on the left. In the middle, attention score maps are visualized, and on the right, attention score maps are enhanced using a domain adapter in a Domain-specific Encoder. Warmer colors, such as red, indicate higher attention scores, while cooler colors, such as blue, represent lower attention scores.
  • Figure 4: Examples illustrating instances where existing methods demonstrate success ($\color{red}\checkmark$) in the SNARE dataset but failure ($\color{blue}\times$) in the Simulation-SNARE dataset, in contrast to DA4LG, which maintains robust performance across both datasets ($\color{red}\checkmark$). We visualize the language description (left), Simulation-SNARE (middle), and SNARE (right). For the Simulation-SNARE examples, we showcase the front, bird, and side views. For the SNARE examples, we showcase all eight views.