DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding

Ting Liu; Xuyang Liu; Siteng Huang; Honggang Chen; Quanjun Yin; Long Qin; Donglin Wang; Yue Hu

DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding

Ting Liu, Xuyang Liu, Siteng Huang, Honggang Chen, Quanjun Yin, Long Qin, Donglin Wang, Yue Hu

TL;DR

The paper tackles the high cost of fine-tuning large visual grounding models by introducing DARA, a parameter-efficient tuning framework built from Domain-aware Adapters and Relation-aware Adapters. DA Adapters refine intra-modality representations, while RA Adapters establish early cross-modal interactions to enhance spatial reasoning, allowing the backbone networks to remain largely frozen. On three VG benchmarks, DARA achieves the best accuracy among PETL methods and even surpasses full fine-tuning in some cases, while updating only a small fraction of backbone parameters. This approach demonstrates that targeted, domain- and relation-aware adapters can substantially reduce compute without sacrificing performance, advancing practical VG deployment and offering a blueprint for PETL in cross-modal tasks.

Abstract

Visual grounding (VG) is a challenging task to localize an object in an image based on a textual description. Recent surge in the scale of VG models has substantially improved performance, but also introduced a significant burden on computational costs during fine-tuning. In this paper, we explore applying parameter-efficient transfer learning (PETL) to efficiently transfer the pre-trained vision-language knowledge to VG. Specifically, we propose \textbf{DARA}, a novel PETL method comprising \underline{\textbf{D}}omain-aware \underline{\textbf{A}}dapters (DA Adapters) and \underline{\textbf{R}}elation-aware \underline{\textbf{A}}dapters (RA Adapters) for VG. DA Adapters first transfer intra-modality representations to be more fine-grained for the VG domain. Then RA Adapters share weights to bridge the relation between two modalities, improving spatial reasoning. Empirical results on widely-used benchmarks demonstrate that DARA achieves the best accuracy while saving numerous updated parameters compared to the full fine-tuning and other PETL methods. Notably, with only \textbf{2.13\%} tunable backbone parameters, DARA improves average accuracy by \textbf{0.81\%} across the three benchmarks compared to the baseline model. Our code is available at \url{https://github.com/liuting20/DARA}.

DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding

TL;DR

Abstract

Paper Structure (11 sections, 5 equations, 3 figures, 4 tables)

This paper contains 11 sections, 5 equations, 3 figures, 4 tables.

Introduction
Related Work
Methodology
Baseline Model
Domain-aware and Relation-aware Adapters
Experiments
Experimental Settings
Main Results
Ablation Study and Analysis
Conclusion
Acknowledgments

Figures (3)

Figure 1: Comparison of (a) full fine-tuning deng2021transvg and (b) our PETL method for visual grounding. (c) Freezing () the pre-trained backbones and updating () our DARA reduces 97.87% of backbone updated parameters while achieving even stronger performance than full fine-tuning paradigm.
Figure 2: (a) Overview of our proposed parameter-efficient tuning framework for visual grounding. We freeze () the vision and language backbones and update () our DARA, comprising the Domain-aware Adapters (DA Adapters) and Relation-aware Adapters (RA Adapters). (b) Detailed design of the DARA. The DA Adapters transfer the pre-trained rich intra-modality representations, making them more fine-grained for the visual grounding domain. Subsequently, the RA Adapters then share the adapters' weights to bridge the relation between the two backbones and capture inter-modality representations.
Figure 3: Visualizations of cross-attention maps between the [REG] token and visual tokens under different strategies.

DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding

TL;DR

Abstract

DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding

Authors

TL;DR

Abstract

Table of Contents

Figures (3)