Table of Contents
Fetching ...

Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model

SuBeen Lee, GilHan Park, WonJun Moon, Hyun Seok Seong, Jae-Pil Heo

TL;DR

<3-5 sentence high-level summary> The paper tackles performance gaps in Vision-Language Models under distribution shifts by introducing Auxiliary Descriptive Knowledge (ADK) for Few-Shot Adaptation. ADK offline-generates class-specific descriptions from an LLM and derives two knowledge signals—compositional (class-level) and instance-specific (image-conditioned)—which are combined with the handcrafted prompt to guide PEFT-based VLM adaptation. This plug-and-play framework improves across base-to-novel, all-to-all, and cross-domain settings, achieving state-of-the-art results with minimal online computation. The approach is validated through extensive ablations, overhead analyses, and varying description counts, demonstrating robustness and practical applicability in diverse few-shot regimes.

Abstract

Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.

Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model

TL;DR

<3-5 sentence high-level summary> The paper tackles performance gaps in Vision-Language Models under distribution shifts by introducing Auxiliary Descriptive Knowledge (ADK) for Few-Shot Adaptation. ADK offline-generates class-specific descriptions from an LLM and derives two knowledge signals—compositional (class-level) and instance-specific (image-conditioned)—which are combined with the handcrafted prompt to guide PEFT-based VLM adaptation. This plug-and-play framework improves across base-to-novel, all-to-all, and cross-domain settings, achieving state-of-the-art results with minimal online computation. The approach is validated through extensive ablations, overhead analyses, and varying description counts, demonstrating robustness and practical applicability in diverse few-shot regimes.

Abstract

Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.

Paper Structure

This paper contains 26 sections, 14 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Comparison between existing approaches and our ADK. (a) Existing approaches rely on a fixed handcrafted prompt (e.g., "a photo of <class>"), which often provides insufficient information to distinguish categories, such as the 'DH-82' ($I_1$) and the 'DHC-1' ($I_2$). (b) Our ADK introduces additional class- and image-specific descriptive knowledge (e.g., "Biplane Wing" for $I_1$, "Low-wing" for $I_2$). These additional clues enable the model to distinguish categories more effectively. (c) Computational cost comparison with CoCoOp, which leverages image-induced context features. In contrast to CoCoOp, which suffers from significant computational overhead, our ADK introduces negligible additional cost. (d) Harmonic mean performance in the base-to-novel scenario when integrating ADK with various methods. ADK demonstrates superior generalization capabilities while preserving adaptation capabilities.
  • Figure 2: Overall pipeline of Auxiliary Descriptive Knowledge. Given the candidate $N$ classes and an image $x_i$, we query a Large-Language Model to obtain $M$ (here, $M=2$) descriptions $\text{T}^\text{desc}_{n,m}$ for each class, capturing diverse characteristics of that class. Subsequently, we extract handcrafted features $t^\text{hand}_n$, image features $v_i$, and descriptive features $t^\text{desc}_{n,m}$ for the handcrafted prompt, image, and descriptions. Then, we generate compositional knowledge $t^\text{comp}_n$ by averaging descriptive features to capture general class-level semantics. We also derive instance-specific knowledge $t^\text{inst}_{i,n}$ by selectively weighting descriptive features based on their similarity to the image features. Finally, we utilize these two types of auxiliary knowledge alongside the handcrafted features to optimize the model and predict the image class.
  • Figure 3: Example images and classes for each dataset. From top: FGVC Aircraft, Stanford Cars, and Caltech101.
  • Figure 4: Visualization of attention weights $W_{i,n,m}$ for instance-specific knowledge, where the image and descriptions belong to the same ground-truth class ($y_i=n$). The top four descriptions with high weights are listed in order, with the remaining descriptions grouped under 'Other'. Best viewed in zoom-in.
  • Figure 5: K-shot results under varying K. Results are averaged on all 11 benchmark datasets for FSA-VLM.
  • ...and 3 more figures