Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

Zelin Peng; Zhengqin Xu; Zhilin Zeng; Yaoming Wang; Wei Shen

Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yaoming Wang, Wei Shen

TL;DR

Open-vocabulary semantic segmentation faces high compute demands, cross-modal misalignment, and poor generalization when adapting CLIP for pixel-level predictions. The authors propose H-CLIP, a symmetric, parameter-efficient fine-tuning approach in hyperspherical space that uses block-diagonal, orthogonal transformations (POP) for the text encoder and a dual cross-relation communication (DCRC) module to align both CLIP modalities. By enforcing hyperspherical energy constraints on the text encoder and enabling cross-modal/cross-layer interactions through tensor-based operations, H-CLIP achieves state-of-the-art open-vocabulary segmentation while updating only about $4\%$ of CLIP's parameters. This method offers a scalable, generalizable pathway to graft pixel-level capabilities onto large vision-language models with minimal retraining cost.

Abstract

Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions. Vision-language foundation models, especially CLIP, have recently emerged as powerful tools for acquiring open-vocabulary capabilities. However, fine-tuning CLIP to equip it with pixel-level prediction ability often suffers three issues: 1) high computational cost, 2) misalignment between the two inherent modalities of CLIP, and 3) degraded generalization ability on unseen categories. To address these issues, we propose H-CLIP a symmetrical parameter-efficient fine-tuning (PEFT) strategy conducted in hyperspherical space for both of the two CLIP modalities. Specifically, the PEFT strategy is achieved by a series of efficient block-diagonal learnable transformation matrices and a dual cross-relation communication module among all learnable matrices. Since the PEFT strategy is conducted symmetrically to the two CLIP modalities, the misalignment between them is mitigated. Furthermore, we apply an additional constraint to PEFT on the CLIP text encoder according to the hyperspherical energy principle, i.e., minimizing hyperspherical energy during fine-tuning preserves the intrinsic structure of the original parameter space, to prevent the destruction of the generalization ability offered by the CLIP text encoder. Extensive evaluations across various benchmarks show that H-CLIP achieves new SOTA open-vocabulary semantic segmentation results while only requiring updating approximately 4% of the total parameters of CLIP.

Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

TL;DR

of CLIP's parameters. This method offers a scalable, generalizable pathway to graft pixel-level capabilities onto large vision-language models with minimal retraining cost.

Abstract

Paper Structure (17 sections, 21 equations, 2 figures, 5 tables)

This paper contains 17 sections, 21 equations, 2 figures, 5 tables.

Introduction
Related Work
Open-vocabulary Semantic Segmentation
Large-scale Model Fine-tuning
Preliminaries
Hyperspherical Energy
Notation of Tensor Product
Methodology
Overview of H-CLIP
Partial Orthogonal Parameterization
Dual Cross Relation Communication
Overall Architecture
Experiments
Experimental Setup
Main Results
...and 2 more sections

Figures (2)

Figure 1: A schematic representation of H-CLIP. In the H-CLIP framework, we propose a partial orthogonal fine-tuning strategy, where each pre-trained weight matrix is paired with a tuned block-diagonal transformation matrix, some of which are orthogonal to preserve generalization. Then, we introduce a dual cross-relation communication mechanism to facilitate communication among all matrices, enabling alignment between different modalities.
Figure 2: Comparison of qualitative reults on ADE20K ADE20K_IJCV_2019 with 150 categories. We compare H-CLIP with a state-of-the-art method, i.e., CAT-Seg catseg_cvpr_2024.

Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

TL;DR

Abstract

Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)