Table of Contents
Fetching ...

BATR-FST: Bi-Level Adaptive Token Refinement for Few-Shot Transformers

Mohammed Al-Habib, Zuping Zhang, Abdulrahman Noman

TL;DR

This work tackles few-shot classification with Vision Transformers by introducing BATR-FST, a two-stage framework that first leverages Masked Image Modeling for robust patch-level pretraining and then applies Bi-Level Adaptive Token Refinement to dynamically refine token representations. The methodology integrates Graph Construction and Token Clustering, Uncertainty-Aware Token Weighting, Bi-Level Attention, Graph Token Propagation, and a Class Separation Penalty to balance local and global context while preserving discriminative capacity. Empirical results across mini-ImageNet, tiered-ImageNet, and CIFAR-FS demonstrate strong 1-shot and 5-shot performance with ViT-S backbones and 22M parameters, highlighting improved generalization in data-scarce regimes. The approach offers a principled pathway to enhance transformer-based few-shot learning by refining token-level interactions and enforcing semantic coherence between support and query tokens.

Abstract

Vision Transformers (ViTs) have shown significant promise in computer vision applications. However, their performance in few-shot learning is limited by challenges in refining token-level interactions, struggling with limited training data, and developing a strong inductive bias. Existing methods often depend on inflexible token matching or basic similarity measures, which limit the effective incorporation of global context and localized feature refinement. To address these challenges, we propose Bi-Level Adaptive Token Refinement for Few-Shot Transformers (BATR-FST), a two-stage approach that progressively improves token representations and maintains a robust inductive bias for few-shot classification. During the pre-training phase, Masked Image Modeling (MIM) provides Vision Transformers (ViTs) with transferable patch-level representations by recreating masked image regions, providing a robust basis for subsequent adaptation. In the meta-fine-tuning phase, BATR-FST incorporates a Bi-Level Adaptive Token Refinement module that utilizes Token Clustering to capture localized interactions, Uncertainty-Aware Token Weighting to prioritize dependable features, and a Bi-Level Attention mechanism to balance intra-cluster and inter-cluster relationships, thereby facilitating thorough token refinement. Furthermore, Graph Token Propagation ensures semantic consistency between support and query instances, while a Class Separation Penalty preserves different class borders, enhancing discriminative capability. Extensive experiments on three benchmark few-shot datasets demonstrate that BATR-FST achieves superior results in both 1-shot and 5-shot scenarios and improves the few-shot classification via transformers.

BATR-FST: Bi-Level Adaptive Token Refinement for Few-Shot Transformers

TL;DR

This work tackles few-shot classification with Vision Transformers by introducing BATR-FST, a two-stage framework that first leverages Masked Image Modeling for robust patch-level pretraining and then applies Bi-Level Adaptive Token Refinement to dynamically refine token representations. The methodology integrates Graph Construction and Token Clustering, Uncertainty-Aware Token Weighting, Bi-Level Attention, Graph Token Propagation, and a Class Separation Penalty to balance local and global context while preserving discriminative capacity. Empirical results across mini-ImageNet, tiered-ImageNet, and CIFAR-FS demonstrate strong 1-shot and 5-shot performance with ViT-S backbones and 22M parameters, highlighting improved generalization in data-scarce regimes. The approach offers a principled pathway to enhance transformer-based few-shot learning by refining token-level interactions and enforcing semantic coherence between support and query tokens.

Abstract

Vision Transformers (ViTs) have shown significant promise in computer vision applications. However, their performance in few-shot learning is limited by challenges in refining token-level interactions, struggling with limited training data, and developing a strong inductive bias. Existing methods often depend on inflexible token matching or basic similarity measures, which limit the effective incorporation of global context and localized feature refinement. To address these challenges, we propose Bi-Level Adaptive Token Refinement for Few-Shot Transformers (BATR-FST), a two-stage approach that progressively improves token representations and maintains a robust inductive bias for few-shot classification. During the pre-training phase, Masked Image Modeling (MIM) provides Vision Transformers (ViTs) with transferable patch-level representations by recreating masked image regions, providing a robust basis for subsequent adaptation. In the meta-fine-tuning phase, BATR-FST incorporates a Bi-Level Adaptive Token Refinement module that utilizes Token Clustering to capture localized interactions, Uncertainty-Aware Token Weighting to prioritize dependable features, and a Bi-Level Attention mechanism to balance intra-cluster and inter-cluster relationships, thereby facilitating thorough token refinement. Furthermore, Graph Token Propagation ensures semantic consistency between support and query instances, while a Class Separation Penalty preserves different class borders, enhancing discriminative capability. Extensive experiments on three benchmark few-shot datasets demonstrate that BATR-FST achieves superior results in both 1-shot and 5-shot scenarios and improves the few-shot classification via transformers.

Paper Structure

This paper contains 22 sections, 12 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The framework of our Bi-Level Adaptive Token Refinement method utilizes a ViT-Small architecture as the feature extractor.
  • Figure 2: The Grad-Cam visualization of our method on the mini-Imagenet. Each column in both groups belongs to the same class.
  • Figure 3: Analysis of training dynamics: (a) Validation accuracy across epochs shows the learning behavior, and (b) inner-loop iterations highlight their influence on few-shot performance.
  • Figure 4: Effect of $k_{\text{local}}$ (a) and $k_{\text{global}}$ (b) on test accuracy for the mini-ImageNet 5-way 5-shot task. (a) Number of local tokens retained for fine-grained interactions. (b) Number of global tokens for cross-cluster attention to maintain global semantics.