Table of Contents
Fetching ...

DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding

Ting Liu, Xuyang Liu, Siteng Huang, Honggang Chen, Quanjun Yin, Long Qin, Donglin Wang, Yue Hu

TL;DR

The paper tackles the high cost of fine-tuning large visual grounding models by introducing DARA, a parameter-efficient tuning framework built from Domain-aware Adapters and Relation-aware Adapters. DA Adapters refine intra-modality representations, while RA Adapters establish early cross-modal interactions to enhance spatial reasoning, allowing the backbone networks to remain largely frozen. On three VG benchmarks, DARA achieves the best accuracy among PETL methods and even surpasses full fine-tuning in some cases, while updating only a small fraction of backbone parameters. This approach demonstrates that targeted, domain- and relation-aware adapters can substantially reduce compute without sacrificing performance, advancing practical VG deployment and offering a blueprint for PETL in cross-modal tasks.

Abstract

Visual grounding (VG) is a challenging task to localize an object in an image based on a textual description. Recent surge in the scale of VG models has substantially improved performance, but also introduced a significant burden on computational costs during fine-tuning. In this paper, we explore applying parameter-efficient transfer learning (PETL) to efficiently transfer the pre-trained vision-language knowledge to VG. Specifically, we propose \textbf{DARA}, a novel PETL method comprising \underline{\textbf{D}}omain-aware \underline{\textbf{A}}dapters (DA Adapters) and \underline{\textbf{R}}elation-aware \underline{\textbf{A}}dapters (RA Adapters) for VG. DA Adapters first transfer intra-modality representations to be more fine-grained for the VG domain. Then RA Adapters share weights to bridge the relation between two modalities, improving spatial reasoning. Empirical results on widely-used benchmarks demonstrate that DARA achieves the best accuracy while saving numerous updated parameters compared to the full fine-tuning and other PETL methods. Notably, with only \textbf{2.13\%} tunable backbone parameters, DARA improves average accuracy by \textbf{0.81\%} across the three benchmarks compared to the baseline model. Our code is available at \url{https://github.com/liuting20/DARA}.

DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding

TL;DR

The paper tackles the high cost of fine-tuning large visual grounding models by introducing DARA, a parameter-efficient tuning framework built from Domain-aware Adapters and Relation-aware Adapters. DA Adapters refine intra-modality representations, while RA Adapters establish early cross-modal interactions to enhance spatial reasoning, allowing the backbone networks to remain largely frozen. On three VG benchmarks, DARA achieves the best accuracy among PETL methods and even surpasses full fine-tuning in some cases, while updating only a small fraction of backbone parameters. This approach demonstrates that targeted, domain- and relation-aware adapters can substantially reduce compute without sacrificing performance, advancing practical VG deployment and offering a blueprint for PETL in cross-modal tasks.

Abstract

Visual grounding (VG) is a challenging task to localize an object in an image based on a textual description. Recent surge in the scale of VG models has substantially improved performance, but also introduced a significant burden on computational costs during fine-tuning. In this paper, we explore applying parameter-efficient transfer learning (PETL) to efficiently transfer the pre-trained vision-language knowledge to VG. Specifically, we propose \textbf{DARA}, a novel PETL method comprising \underline{\textbf{D}}omain-aware \underline{\textbf{A}}dapters (DA Adapters) and \underline{\textbf{R}}elation-aware \underline{\textbf{A}}dapters (RA Adapters) for VG. DA Adapters first transfer intra-modality representations to be more fine-grained for the VG domain. Then RA Adapters share weights to bridge the relation between two modalities, improving spatial reasoning. Empirical results on widely-used benchmarks demonstrate that DARA achieves the best accuracy while saving numerous updated parameters compared to the full fine-tuning and other PETL methods. Notably, with only \textbf{2.13\%} tunable backbone parameters, DARA improves average accuracy by \textbf{0.81\%} across the three benchmarks compared to the baseline model. Our code is available at \url{https://github.com/liuting20/DARA}.
Paper Structure (11 sections, 5 equations, 3 figures, 4 tables)

This paper contains 11 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparison of (a) full fine-tuning deng2021transvg and (b) our PETL method for visual grounding. (c) Freezing () the pre-trained backbones and updating () our DARA reduces 97.87% of backbone updated parameters while achieving even stronger performance than full fine-tuning paradigm.
  • Figure 2: (a) Overview of our proposed parameter-efficient tuning framework for visual grounding. We freeze () the vision and language backbones and update () our DARA, comprising the Domain-aware Adapters (DA Adapters) and Relation-aware Adapters (RA Adapters). (b) Detailed design of the DARA. The DA Adapters transfer the pre-trained rich intra-modality representations, making them more fine-grained for the visual grounding domain. Subsequently, the RA Adapters then share the adapters' weights to bridge the relation between the two backbones and capture inter-modality representations.
  • Figure 3: Visualizations of cross-attention maps between the [REG] token and visual tokens under different strategies.