Table of Contents
Fetching ...

Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

Mohammed Rahman Sherif Khan Mohammad, Ardhendu Behera, Sandip Pradhan, Swagat Kumar, Amr Ahmed

Abstract

Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.

Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

Abstract

Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.
Paper Structure (22 sections, 4 equations, 14 figures, 14 tables)

This paper contains 22 sections, 4 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: The core trade-off of VLM adaptation and our proposed solution. (a) Lightweight adapters (e.g., Tip-Adapter tipadapter) are fast but limited, as they only reason over the single, global image feature. (b) Heavyweight adapters (e.g., GraphAdapter li2023graphadapter) are more powerful, reasoning over visual patches, but their complex GNN adds a permanent, significant cost at inference time. (c) Our proposed method, TOGA (the best of both), uses a powerful auxiliary teacher. This teacher, our modality-aware graph transformer (MGT), performs deep bi-modal reasoning on a unified graph of patches and text. It supervises its knowledge into the lightweight "student" adapter, which is the only component used at inference, achieving high performance with the efficiency of (a).
  • Figure 1: Comparison of few-shot classification accuracy (%) on 11 benchmark datasets. We evaluate our method against several SOTA adapter-based approaches. The best performance in each shot-group is marked in bold. Our results are highlighted in blue. Dataset abbreviations: INet (ImageNet), SUN (SUN397), Air (FGVC-Aircraft), Euro (EuroSAT), Cars (Stanford Cars), Food (Food101), Pets (OxfordPets), Flow (Flowers102), Cal (Caltech101), DTD (Describable Textures), UCF (UCF101).
  • Figure 2: Our Training-Only Graph Adapter (TOGA) is an asymmetric supervision pipeline. At train time (red dotted region and $\rightarrow$ denote training only), we use a three-branch ensemble: (1) a frozen $L_\text{ZS}$ branch, (2) a lightweight student (Cache Model) $L_\text{Cache}$, and (3) our powerful, auxiliary graph teacher $L_\text{Graph}$. The teacher enriches multi-scale patches and text embeddings via unimodal Transformers, then constructs a modality-aware graph transformer to perform cross-modal reasoning. A dual-loss objective regularizes the teacher's knowledge into the student's adapter $\mathcal{A}$. At test time, the entire teacher branch is discarded, resulting in zero additional inference cost.
  • Figure 3: Comparison of few-shot accuracy (%) on three benchmarks and the 11-dataset average. More results in the supplementary.
  • Figure 4: Qualitative visualization of Discriminative Node Filtering. Our Top-$\mathbb{N}$ filtering learns to retain high-scoring, discriminative foreground patches (Green) while suppressing non-informative background nodes (Blue). This resolves feature dilution inherent in global pooling, yielding a cleaner signal by focusing on key object parts (e.g., ant's head, flower pistil, cat's eyes). Samples from Caltech-101; more visualizations in supplementary.
  • ...and 9 more figures