Training Spatial-Frequency Visual Prompts and Probabilistic Clusters for Accurate Black-Box Transfer Learning
Wonwoo Cho, Kangyeol Kim, Saemee Choi, Jaegul Choo
TL;DR
The paper tackles few-shot transfer learning under data scarcity for black-box pre-trained vision models by coupling a spatial-frequency visual prompter with probabilistic cluster-based output refinement. It introduces an encoder-free visual prompter that generates both spatial and low-frequency prompts through a single decoder and frequency-domain manipulation via $\text{dct}$/$\text{idct}$, alongside KL-means based refinement on the output probabilities using auxiliary simplex prototypes. The training objective combines classification, auxiliary, and intra-class relation losses, with SPSA-GC style zeroth-order optimization and gradient-surgery to stabilize updates in a black-box setting. Empirically, the method achieves superior few-shot transfer performance across 14 datasets and synthetic distribution-shift tasks, while reducing training/inference costs compared to prior black-box prompting approaches. Overall, this approach enhances practical adaptability and efficiency of large PTMs when API access constrains gradient-based fine-tuning, supporting broader deployment in resource-limited real-world scenarios.
Abstract
Despite the growing prevalence of black-box pre-trained models (PTMs) such as prediction API services, there remains a significant challenge in directly applying general models to real-world scenarios due to the data distribution gap. Considering a data deficiency and constrained computational resource scenario, this paper proposes a novel parameter-efficient transfer learning framework for vision recognition models in the black-box setting. Our framework incorporates two novel training techniques. First, we align the input space (i.e., image) of PTMs to the target data distribution by generating visual prompts of spatial and frequency domain. Along with the novel spatial-frequency hybrid visual prompter, we design a novel training technique based on probabilistic clusters, which can enhance class separation in the output space (i.e., prediction probabilities). In experiments, our model demonstrates superior performance in a few-shot transfer learning setting across extensive visual recognition datasets, surpassing state-of-the-art baselines. Additionally, we show that the proposed method efficiently reduces computational costs for training and inference phases.
