Table of Contents
Fetching ...

AnchorOPT: Towards Optimizing Dynamic Anchors for Adaptive Prompt Learning

Zheng Li, Yibing Song, Xin Zhang, Lei Luo, Xiang Li, Jian Yang

TL;DR

AnchorOPT tackles the rigidity of fixed anchors in CLIP prompt learning by introducing dynamic anchors $t_{anc}$ and a learnable position matrix $W$ to adapt prompts to task context. It uses a two-stage training: Stage I optimizes $t_{anc}$ via alignment with LL-generated descriptions $t_d$, and Stage II freezes anchors and jointly optimizes soft tokens and $W$ (including a deep variant that preserves anchors across layers). Experiments on 11 datasets show base-to-novel and cross-dataset generalization gains when integrating AnchorOPT with strong baselines, often exceeding methods that add extra modules or regularization. The approach is plug-and-play, demonstrating that simple, dynamic prompt structures can yield strong generalization for CLIP.

Abstract

Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., "shape", "color"), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.

AnchorOPT: Towards Optimizing Dynamic Anchors for Adaptive Prompt Learning

TL;DR

AnchorOPT tackles the rigidity of fixed anchors in CLIP prompt learning by introducing dynamic anchors and a learnable position matrix to adapt prompts to task context. It uses a two-stage training: Stage I optimizes via alignment with LL-generated descriptions , and Stage II freezes anchors and jointly optimizes soft tokens and (including a deep variant that preserves anchors across layers). Experiments on 11 datasets show base-to-novel and cross-dataset generalization gains when integrating AnchorOPT with strong baselines, often exceeding methods that add extra modules or regularization. The approach is plug-and-play, demonstrating that simple, dynamic prompt structures can yield strong generalization for CLIP.

Abstract

Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., "shape", "color"), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.

Paper Structure

This paper contains 30 sections, 12 equations, 10 figures, 13 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of fundamental input prompt structures. (a) CLIPradford2021learning employs manually designed text templates. (b) CoOpzhou2022learning introduced prompt learning for CLIP adaptation using learnable soft tokens concatenated with fixed category tokens. (c) ATPrompt li2025advancing incorporates explicit, fixed attribute tokens (e.g., "color", "shape") to guide soft token learning via an attribute-based template. (d) AnchorOPT utilizes implicit anchors learned from data to guide the learning of soft tokens and proposes a learnable position matrix that dynamically adjusts the prompt sequence according to downstream requirements.
  • Figure 2: The dynamic anchor token training process comprises two stages: (a) Anchor Optimization: Anchor tokens are initialized as learnable parameters and optimized using LLM-generated category descriptions. The resulting anchors are frozen for the subsequent stage. (b) Adaptation: Soft prompts and the position matrix are jointly optimized for downstream tasks, with knowledge distillation from ensemble results providing auxiliary supervision.
  • Figure 3: Computational process in deep prompt learning variants: (a) MaPLe drops and reintroduces all soft tokens after each Transformer block. (b) ATPrompt retains all attribute-related hard/soft tokens while discarding class-related soft tokens. (c) AnchorOPT dynamically reorders tokens via the position matrix, retaining only anchor tokens and discarding all soft tokens during processing.
  • Figure 4: Position matrix visualizations across five datasets. For the Oxford Pets dataset, a value of 1 at row 1, column 4 indicates that token 4 in the original sequence is mapped to position 1 in the transformed sequence. Each dataset exhibits distinct convergence patterns. Note that all-zero values in the final column of the position matrix do not imply anchor tokens are redundant; though omitted from the visualization, these tokens contribute to intermediate computation stages during training.
  • Figure S1: Illustration of one-stage training paradigm. The framework alternates between (i) optimizing anchor tokens and (ii) updating soft tokens and the position matrix while freezing anchors, iterating until convergence.
  • ...and 5 more figures