Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

Eric Brouwer; Jan Erik van Woerden; Gertjan Burghouts; Matias Valdenegro-Toro; Marco Zullich

Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

Eric Brouwer, Jan Erik van Woerden, Gertjan Burghouts, Matias Valdenegro-Toro, Marco Zullich

TL;DR

This work tackles the challenge of few-shot, fine-grained visual classification by augmenting CLIP with Adaptive Prompt Tuning (APT), a cross-attention mechanism that dynamically refines text prompts conditioned on the input image. Unlike static prompting methods, APT keeps the image and text encoders frozen while training a cross-attention layer that aligns textual representations with image patches, enhancing discriminability in high-variance datasets. The approach is further strengthened with Monte-Carlo Dropout to produce calibrated uncertainty estimates, enabling reliable predictions and meaningful confidence assessments. Evaluations on CUBirds, Oxford Flowers, and FGVC Aircraft show significant improvements over CoOp and VPT, especially in challenging fine-grained settings, with robust uncertainty analyses validating the trustworthiness of the predictions. Overall, the method advances state-of-the-art few-shot fine-grained classification and provides practical UQ insights for deployment in real-world settings.

Abstract

Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data. This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training (CLIP) model through adaptive prompt tuning, guided by real-time visual inputs. Unlike existing techniques such as Context Optimization (CoOp) and Visual Prompt Tuning (VPT), which are constrained by static prompts or visual token reliance, the proposed approach leverages a cross-attention mechanism to dynamically refine text prompts for the image at hand. This enables an image-specific alignment of textual features with image patches extracted from the Vision Transformer, making the model more effective for datasets with high intra-class variance and low inter-class differences. The method is evaluated on several datasets, including CUBirds, Oxford Flowers, and FGVC Aircraft, showing significant performance gains over static prompt tuning approaches. To ensure these performance gains translate into trustworthy predictions, we integrate Monte-Carlo Dropout in our approach to improve the reliability of the model predictions and uncertainty estimates. This integration provides valuable insights into the model's predictive confidence, helping to identify when predictions can be trusted and when additional verification is necessary. This dynamic approach offers a robust solution, advancing the state-of-the-art for few-shot fine-grained classification.

Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

TL;DR

Abstract

Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)