Table of Contents
Fetching ...

Bayesian Exploration of Pre-trained Models for Low-shot Image Classification

Yibo Miao, Yu Lei, Feng Zhou, Zhijie Deng

TL;DR

Low-shot image classification with CLIP often underutilizes other pre-trained models. The authors propose a Gaussian process ensemble that uses the CLIP zero-shot classifier as the mean and a sum of deep kernels derived from multiple pre-trained models as the kernel, enabling analytical inference and uncertainty quantification. Across benchmarks, the approach yields competitive or superior predictive performance, robust OOD generalization, and improved calibration compared to deterministic baselines. This work demonstrates the practical value of Bayesian model combination for leveraging diverse priors in the large-model era and highlights opportunities for uncertainty-aware deployment and OOD detection.

Abstract

Low-shot image classification is a fundamental task in computer vision, and the emergence of large-scale vision-language models such as CLIP has greatly advanced the forefront of research in this field. However, most existing CLIP-based methods lack the flexibility to effectively incorporate other pre-trained models that encompass knowledge distinct from CLIP. To bridge the gap, this work proposes a simple and effective probabilistic model ensemble framework based on Gaussian processes, which have previously demonstrated remarkable efficacy in processing small data. We achieve the integration of prior knowledge by specifying the mean function with CLIP and the kernel function with an ensemble of deep kernels built upon various pre-trained models. By regressing the classification label directly, our framework enables analytical inference, straightforward uncertainty quantification, and principled hyper-parameter tuning. Through extensive experiments on standard benchmarks, we demonstrate that our method consistently outperforms competitive ensemble baselines regarding predictive performance. Additionally, we assess the robustness of our method and the quality of the yielded uncertainty estimates on out-of-distribution datasets. We also illustrate that our method, despite relying on label regression, still enjoys superior model calibration compared to most deterministic baselines.

Bayesian Exploration of Pre-trained Models for Low-shot Image Classification

TL;DR

Low-shot image classification with CLIP often underutilizes other pre-trained models. The authors propose a Gaussian process ensemble that uses the CLIP zero-shot classifier as the mean and a sum of deep kernels derived from multiple pre-trained models as the kernel, enabling analytical inference and uncertainty quantification. Across benchmarks, the approach yields competitive or superior predictive performance, robust OOD generalization, and improved calibration compared to deterministic baselines. This work demonstrates the practical value of Bayesian model combination for leveraging diverse priors in the large-model era and highlights opportunities for uncertainty-aware deployment and OOD detection.

Abstract

Low-shot image classification is a fundamental task in computer vision, and the emergence of large-scale vision-language models such as CLIP has greatly advanced the forefront of research in this field. However, most existing CLIP-based methods lack the flexibility to effectively incorporate other pre-trained models that encompass knowledge distinct from CLIP. To bridge the gap, this work proposes a simple and effective probabilistic model ensemble framework based on Gaussian processes, which have previously demonstrated remarkable efficacy in processing small data. We achieve the integration of prior knowledge by specifying the mean function with CLIP and the kernel function with an ensemble of deep kernels built upon various pre-trained models. By regressing the classification label directly, our framework enables analytical inference, straightforward uncertainty quantification, and principled hyper-parameter tuning. Through extensive experiments on standard benchmarks, we demonstrate that our method consistently outperforms competitive ensemble baselines regarding predictive performance. Additionally, we assess the robustness of our method and the quality of the yielded uncertainty estimates on out-of-distribution datasets. We also illustrate that our method, despite relying on label regression, still enjoys superior model calibration compared to most deterministic baselines.
Paper Structure (21 sections, 8 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 8 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of our method. We leverage a GP regressor to tackle the low-shot image classification problem. To integrate knowledge from CLIP and other pre-trained models, we use them to specify the GP mean and kernel. The label is determined by the mean, and the uncertainty estimate is determined by the variance.
  • Figure 2: Comparison of low-shot classification accuracy (%) on the ten popular benchmarks.
  • Figure 3: Histogram for uncertainty estimates. We evaluate different ensemble methods on ImageNet, ImageNet-V2, and Imagenet-Sketch.
  • Figure 4: Realibility diagrams of the four ensemble methods.
  • Figure 5: Ablation studies on (a) GP mean, (b) GP base kernel, (c) Pre-trained model, and (d) hyper-parameter optimization objective. All experiments are conducted on ImageNet.
  • ...and 3 more figures