Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation

Jiaqi Huang; Zunnan Xu; Ting Liu; Yong Liu; Haonan Han; Kehong Yuan; Xiu Li

Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation

Jiaqi Huang, Zunnan Xu, Ting Liu, Yong Liu, Haonan Han, Kehong Yuan, Xiu Li

TL;DR

The paper tackles Referring Image Segmentation (RIS) under a parameter-efficient tuning paradigm, addressing the challenge that large vision-language backbones are often misaligned when repurposed for RIS. It introduces DETRIS, which freezes the pre-trained backbone and injects Dense Aligner to enable dense, multi-scale cross-modal feature propagation, augmented by Text Adapters to improve linguistic representations. A cross-modal neck and vision-language decoder fuse visual and textual signals, trained with a text-to-visual contrastive loss, achieving state-of-the-art results with only 0.9%–1.8% backbone updates. On RIS benchmarks such as RefCOCO, RefCOCO+, and G-Ref, DETRIS-L attains 72.2 IoU and DETRIS-B 70.4 IoU, with mixed-training data further boosting performance to 77.2 IoU, illustrating strong efficiency and robustness for dense multi-modal tasks.

Abstract

In the domain of computer vision, Parameter-Efficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks. Our project is available at \url{https://github.com/jiaqihuang01/DETRIS}.

Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation

TL;DR

Abstract

Paper Structure (19 sections, 7 equations, 4 figures, 4 tables)

This paper contains 19 sections, 7 equations, 4 figures, 4 tables.

Introduction
Related Work
Methodology
Framework Overview
Image & Text Feature Extraction
Local & Global Feature Interaction
The Referring Image Segmentation Head
Training Objective
Experiments
Datasets
Implementation Details
Main Results
Qualitative Analysis
Ablation Study
Conclusion
...and 4 more sections

Figures (4)

Figure 1: Overall framework of our DETRIS. In the image branch, we utilize Dense Aligner (DA) to facilitate cross-modal and multi-scale modeling of low-rank visual features. This approach incorporates textual global prior information to enhance the visual features $f_v$. In the text branch, we also use "D-MoC" as our Text Adapters (TA) to obtain the text feature $f_t$.
Figure 2: Qualitative results: (a) the input image; (b) the ground truth; (c) ETRIS; (d) DETRIS-B without Dense Aligner; (e) DETRIS-B without Text Adapter; (f) our proposed DETRIS-B; (g) DETRIS-L using mixed datasets.
Figure 3: Ablation study of the Adapter’s rank and comparison with other Parameter-Efficient Tuning Methods.
Figure 4: Comparison between DETRIS and state-of-the-art PET RIS method ETRIS.

Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation

TL;DR

Abstract

Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)