Table of Contents
Fetching ...

GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

Bin Wang, Ruotong Hu, Wenqian Wang, Wentong Li, Mingliang Gao, Runmin Cong, Wei Zhang

TL;DR

GA2-CLIP addresses semantic narrowing in video-language prompt tuning by introducing generic attribute anchors and externally supervised hard prompts. It couples frozen hard prompts with learnable soft prompts via nonlinear projections and uses an anchor-based regularization objective to improve generalization across base-to-novel and zero-shot scenarios. Extensive experiments on HMDB-51, UCF-101, SSv2, and Kinetics-400 demonstrate substantial gains in cross-domain generalization with minimal computational overhead. The approach offers a practical, plug-and-play pathway for robust video prompt learning, though it notes limitations in explainability and anchor data sources and suggests exploring more video-specific anchors and multimodal LLM integration.

Abstract

Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.

GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

TL;DR

GA2-CLIP addresses semantic narrowing in video-language prompt tuning by introducing generic attribute anchors and externally supervised hard prompts. It couples frozen hard prompts with learnable soft prompts via nonlinear projections and uses an anchor-based regularization objective to improve generalization across base-to-novel and zero-shot scenarios. Extensive experiments on HMDB-51, UCF-101, SSv2, and Kinetics-400 demonstrate substantial gains in cross-domain generalization with minimal computational overhead. The approach offers a practical, plug-and-play pathway for robust video prompt learning, though it notes limitations in explainability and anchor data sources and suggests exploring more video-specific anchors and multimodal LLM integration.

Abstract

Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.

Paper Structure

This paper contains 14 sections, 10 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparative analysis of video-text alignment process via learnable prompts. (a) Current video prompt fine-tuning learning methods destroy the original semantic spatial information, causing the model to lose the ability to discriminate between unknown categories. (b) GA2-CLIP mitigates video semantic bias towards known category by introducing generic attribute anchors and generic attribute prompt.
  • Figure 2: Comparison of existing video prompt tuning architectures. (a) ViFi-CLIP inputs multiple learnable soft tokens combined with class tokens to the text encoder. (b)ViLT-CLIP guides the soft tokens learning by introducing hand-crafted prompt templates from vanilla CLIP to avoid soft tokens learning too well in the base category and thus losing their generalization ability. (c) Our GA2-CLIP innovatively introduces generic attribute anchors and hard prompts to guide the learning of soft tokens, and shows strong performance in both the base category and the novel category.
  • Figure 3: Architecture of the Generic Attribute Anchors CLIP (GA2-CLIP) method for multimodal prompt learning. The approach optimizes the model by adapting the vision and language branches, where only input prompts are learned while keeping the remainder of the model frozen.
  • Figure 4: Comparison of different hard and soft prompt token coupling methods.
  • Figure 5: The effect of different settings on base to novel. (a) The effect of the number of different anchor videos, sampled from 4-64, with 0 indicating no use. (b) The effect of the fusion factor, here the weight of the vanilla factor is fixed to 1.0.
  • ...and 1 more figures