Table of Contents
Fetching ...

Training Spatial-Frequency Visual Prompts and Probabilistic Clusters for Accurate Black-Box Transfer Learning

Wonwoo Cho, Kangyeol Kim, Saemee Choi, Jaegul Choo

TL;DR

The paper tackles few-shot transfer learning under data scarcity for black-box pre-trained vision models by coupling a spatial-frequency visual prompter with probabilistic cluster-based output refinement. It introduces an encoder-free visual prompter that generates both spatial and low-frequency prompts through a single decoder and frequency-domain manipulation via $\text{dct}$/$\text{idct}$, alongside KL-means based refinement on the output probabilities using auxiliary simplex prototypes. The training objective combines classification, auxiliary, and intra-class relation losses, with SPSA-GC style zeroth-order optimization and gradient-surgery to stabilize updates in a black-box setting. Empirically, the method achieves superior few-shot transfer performance across 14 datasets and synthetic distribution-shift tasks, while reducing training/inference costs compared to prior black-box prompting approaches. Overall, this approach enhances practical adaptability and efficiency of large PTMs when API access constrains gradient-based fine-tuning, supporting broader deployment in resource-limited real-world scenarios.

Abstract

Despite the growing prevalence of black-box pre-trained models (PTMs) such as prediction API services, there remains a significant challenge in directly applying general models to real-world scenarios due to the data distribution gap. Considering a data deficiency and constrained computational resource scenario, this paper proposes a novel parameter-efficient transfer learning framework for vision recognition models in the black-box setting. Our framework incorporates two novel training techniques. First, we align the input space (i.e., image) of PTMs to the target data distribution by generating visual prompts of spatial and frequency domain. Along with the novel spatial-frequency hybrid visual prompter, we design a novel training technique based on probabilistic clusters, which can enhance class separation in the output space (i.e., prediction probabilities). In experiments, our model demonstrates superior performance in a few-shot transfer learning setting across extensive visual recognition datasets, surpassing state-of-the-art baselines. Additionally, we show that the proposed method efficiently reduces computational costs for training and inference phases.

Training Spatial-Frequency Visual Prompts and Probabilistic Clusters for Accurate Black-Box Transfer Learning

TL;DR

The paper tackles few-shot transfer learning under data scarcity for black-box pre-trained vision models by coupling a spatial-frequency visual prompter with probabilistic cluster-based output refinement. It introduces an encoder-free visual prompter that generates both spatial and low-frequency prompts through a single decoder and frequency-domain manipulation via /, alongside KL-means based refinement on the output probabilities using auxiliary simplex prototypes. The training objective combines classification, auxiliary, and intra-class relation losses, with SPSA-GC style zeroth-order optimization and gradient-surgery to stabilize updates in a black-box setting. Empirically, the method achieves superior few-shot transfer performance across 14 datasets and synthetic distribution-shift tasks, while reducing training/inference costs compared to prior black-box prompting approaches. Overall, this approach enhances practical adaptability and efficiency of large PTMs when API access constrains gradient-based fine-tuning, supporting broader deployment in resource-limited real-world scenarios.

Abstract

Despite the growing prevalence of black-box pre-trained models (PTMs) such as prediction API services, there remains a significant challenge in directly applying general models to real-world scenarios due to the data distribution gap. Considering a data deficiency and constrained computational resource scenario, this paper proposes a novel parameter-efficient transfer learning framework for vision recognition models in the black-box setting. Our framework incorporates two novel training techniques. First, we align the input space (i.e., image) of PTMs to the target data distribution by generating visual prompts of spatial and frequency domain. Along with the novel spatial-frequency hybrid visual prompter, we design a novel training technique based on probabilistic clusters, which can enhance class separation in the output space (i.e., prediction probabilities). In experiments, our model demonstrates superior performance in a few-shot transfer learning setting across extensive visual recognition datasets, surpassing state-of-the-art baselines. Additionally, we show that the proposed method efficiently reduces computational costs for training and inference phases.
Paper Structure (13 sections, 12 equations, 3 figures, 4 tables)

This paper contains 13 sections, 12 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: A overall training workflow of our proposed method. (a) Our visual prompter consists of a single decoder $D_{\phi^d}$ and two trigger vectors ($\phi^{v_1}, \phi^{v_2}$), where the decoder simultaneously generates two VPs in spatial and frequency domains, respectively. Also, the learnable scaling parameter $\phi^s$ controls the effect of the low-frequency VP according to its efficacy. (b) After obtaining prediction probabilities $P_{\theta}(V_{\phi}(\mathbf{X}))$, we conduct prediction refinement via KL-based cluster analysis. During training, we utilize auxiliary simplex prototypes to enhance the effectiveness of clustering based prediction refinement.
  • Figure 2: An illustration of various visual prompting methods. (a) Visual prompting in the spatial domain is achieved by padding a VP outside each image. The VP itself can be trained using the principles of BAR reprogramming20. (b) The spatial-domain visual prompting of BlackVIP blackvip23, where VPs are generated by an encoder-decoder network. (c) Low-frequency visual prompting, where a decoder makes low-frequency visual prompts located in top-left corner (low-frequency in DCT). (d) Our Spatial-frequency visual prompting method, where spatial- and frequency-domain VPs are simulteneously constructed by a single decoder.
  • Figure 3: From left to right, we present images sampled from the CLEVR dataset and the corresponding GradCAM analysis results of BAR, VP (Black), BlackVIP, and our method.