Table of Contents
Fetching ...

Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

Eric Brouwer, Jan Erik van Woerden, Gertjan Burghouts, Matias Valdenegro-Toro, Marco Zullich

TL;DR

This work tackles the challenge of few-shot, fine-grained visual classification by augmenting CLIP with Adaptive Prompt Tuning (APT), a cross-attention mechanism that dynamically refines text prompts conditioned on the input image. Unlike static prompting methods, APT keeps the image and text encoders frozen while training a cross-attention layer that aligns textual representations with image patches, enhancing discriminability in high-variance datasets. The approach is further strengthened with Monte-Carlo Dropout to produce calibrated uncertainty estimates, enabling reliable predictions and meaningful confidence assessments. Evaluations on CUBirds, Oxford Flowers, and FGVC Aircraft show significant improvements over CoOp and VPT, especially in challenging fine-grained settings, with robust uncertainty analyses validating the trustworthiness of the predictions. Overall, the method advances state-of-the-art few-shot fine-grained classification and provides practical UQ insights for deployment in real-world settings.

Abstract

Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data. This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training (CLIP) model through adaptive prompt tuning, guided by real-time visual inputs. Unlike existing techniques such as Context Optimization (CoOp) and Visual Prompt Tuning (VPT), which are constrained by static prompts or visual token reliance, the proposed approach leverages a cross-attention mechanism to dynamically refine text prompts for the image at hand. This enables an image-specific alignment of textual features with image patches extracted from the Vision Transformer, making the model more effective for datasets with high intra-class variance and low inter-class differences. The method is evaluated on several datasets, including CUBirds, Oxford Flowers, and FGVC Aircraft, showing significant performance gains over static prompt tuning approaches. To ensure these performance gains translate into trustworthy predictions, we integrate Monte-Carlo Dropout in our approach to improve the reliability of the model predictions and uncertainty estimates. This integration provides valuable insights into the model's predictive confidence, helping to identify when predictions can be trusted and when additional verification is necessary. This dynamic approach offers a robust solution, advancing the state-of-the-art for few-shot fine-grained classification.

Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

TL;DR

This work tackles the challenge of few-shot, fine-grained visual classification by augmenting CLIP with Adaptive Prompt Tuning (APT), a cross-attention mechanism that dynamically refines text prompts conditioned on the input image. Unlike static prompting methods, APT keeps the image and text encoders frozen while training a cross-attention layer that aligns textual representations with image patches, enhancing discriminability in high-variance datasets. The approach is further strengthened with Monte-Carlo Dropout to produce calibrated uncertainty estimates, enabling reliable predictions and meaningful confidence assessments. Evaluations on CUBirds, Oxford Flowers, and FGVC Aircraft show significant improvements over CoOp and VPT, especially in challenging fine-grained settings, with robust uncertainty analyses validating the trustworthiness of the predictions. Overall, the method advances state-of-the-art few-shot fine-grained classification and provides practical UQ insights for deployment in real-world settings.

Abstract

Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data. This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training (CLIP) model through adaptive prompt tuning, guided by real-time visual inputs. Unlike existing techniques such as Context Optimization (CoOp) and Visual Prompt Tuning (VPT), which are constrained by static prompts or visual token reliance, the proposed approach leverages a cross-attention mechanism to dynamically refine text prompts for the image at hand. This enables an image-specific alignment of textual features with image patches extracted from the Vision Transformer, making the model more effective for datasets with high intra-class variance and low inter-class differences. The method is evaluated on several datasets, including CUBirds, Oxford Flowers, and FGVC Aircraft, showing significant performance gains over static prompt tuning approaches. To ensure these performance gains translate into trustworthy predictions, we integrate Monte-Carlo Dropout in our approach to improve the reliability of the model predictions and uncertainty estimates. This integration provides valuable insights into the model's predictive confidence, helping to identify when predictions can be trusted and when additional verification is necessary. This dynamic approach offers a robust solution, advancing the state-of-the-art for few-shot fine-grained classification.

Paper Structure

This paper contains 32 sections, 8 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of the proposed APT method. The method leverages CLIP's image and text encoder (see \ref{['fig:clip_architecture']}) to refine the text embeddings for the few-shot classification task. The main novelty introduced by APT is the cross-attention layer, illustrated within the dotted lines. It combines visual and text information, merging them thanks to the cross multi-head attention operation. The resulting output is passed through normalization, dropout, a feed forward layer---responsible for adding non-linearity to the process, and skip connections to produce a set of tuned features, which can netter fit the images at hand for performing the few-shot classification task. This layer is the only component which is trained in the few-shot problem---the weights of image and text encoders are frozen into their pretrained state. The tuned features are later related to the image features using the cosine similarity (see \ref{['eq:cosine_sim']}) for operating the few-shot classification.
  • Figure 2: Diagram depicting the architecture of CLIP used in the present work. CLIP works by employing two Deep Neural Networks---an image encoder and a text encoder. The image encoder, a Vision Transformer (ViT), embeds the image into the $\mathbb{R}^d$ space; the text encoder embeds a natural language sentence in the same space. Image and text pair selected from the COCO dataset lin2014microsoft.
  • Figure 3: Reliability plots help with qualitative evaluation of uncertainty. Points below the dotted diagonal line indicate overconfident behaviors (where confidence is higher than accuracy); above the diagonal line instead is the area of underconfidence. For calibrated models, the points will roughly lie around the diagonal line.
  • Figure 4: Results of the few-shot learning set up. Our approach (red) is compared to the baseline CLIP results (purple), CoOp (blue), and VPT (yellow). Results are from the average scores of 3 models, where the training images are sampled with different sampling seeds.
  • Figure 5: Expected Calibration Error (ECE) across number of training samples. A lower ECE indicates a better calibration. It can be observed that as the number of samples is increased, the ECE decreases.
  • ...and 4 more figures