Table of Contents
Fetching ...

Online Zero-Shot Classification with CLIP

Qi Qian, Juhua Hu

TL;DR

This work tackles the limitation of CLIP-style zero-shot classification by exploiting the target data distribution in an online, storage-free setting. It introduces OnZeta, a scheme that combines online label learning (OnLab) to capture global class distribution with online proxy learning to refine class proxies in the vision space, and then fuses their predictions to reduce variance and bias from the text-vision modality gap. The authors prove convergence guarantees for both online components and demonstrate substantial empirical gains across 14 downstream tasks, including an ImageNet result of approximately $78.94\%$ accuracy without access to the full dataset and competitive improvements over baselines across diverse encoders. The approach enables real-time, distribution-aware zero-shot transfer and offers practical benefits for online services while remaining computationally efficient.

Abstract

Vision-language pre-training such as CLIP enables zero-shot transfer that can classify images according to the candidate class names. While CLIP demonstrates an impressive zero-shot performance on diverse downstream tasks, the distribution from the target data has not been leveraged sufficiently. In this work, we study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction immediately without storing its representation. Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service while considering the statistics of the arrived images as the side information to capture the distribution of target data, which can help improve the performance of real-world applications. To tackle the challenge of effective online optimization, we first develop online label learning to model the target data distribution. Then, the proxy of each class in the vision space is further optimized with the proposed online proxy learning method to mitigate the modality gap between images and text. The convergence of both online strategies can be theoretically guaranteed. By combining the predicted label from the online label learning and proxy learning, our online zero-shot transfer method (OnZeta) achieves $78.94\%$ accuracy on ImageNet without accessing the entire data set. Moreover, extensive experiments on other 13 downstream tasks with different vision encoders show a more than $3\%$ improvement on average, which demonstrates the effectiveness of our proposal. Code is available at \url{https://github.com/idstcv/OnZeta}.

Online Zero-Shot Classification with CLIP

TL;DR

This work tackles the limitation of CLIP-style zero-shot classification by exploiting the target data distribution in an online, storage-free setting. It introduces OnZeta, a scheme that combines online label learning (OnLab) to capture global class distribution with online proxy learning to refine class proxies in the vision space, and then fuses their predictions to reduce variance and bias from the text-vision modality gap. The authors prove convergence guarantees for both online components and demonstrate substantial empirical gains across 14 downstream tasks, including an ImageNet result of approximately accuracy without access to the full dataset and competitive improvements over baselines across diverse encoders. The approach enables real-time, distribution-aware zero-shot transfer and offers practical benefits for online services while remaining computationally efficient.

Abstract

Vision-language pre-training such as CLIP enables zero-shot transfer that can classify images according to the candidate class names. While CLIP demonstrates an impressive zero-shot performance on diverse downstream tasks, the distribution from the target data has not been leveraged sufficiently. In this work, we study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction immediately without storing its representation. Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service while considering the statistics of the arrived images as the side information to capture the distribution of target data, which can help improve the performance of real-world applications. To tackle the challenge of effective online optimization, we first develop online label learning to model the target data distribution. Then, the proxy of each class in the vision space is further optimized with the proposed online proxy learning method to mitigate the modality gap between images and text. The convergence of both online strategies can be theoretically guaranteed. By combining the predicted label from the online label learning and proxy learning, our online zero-shot transfer method (OnZeta) achieves accuracy on ImageNet without accessing the entire data set. Moreover, extensive experiments on other 13 downstream tasks with different vision encoders show a more than improvement on average, which demonstrates the effectiveness of our proposal. Code is available at \url{https://github.com/idstcv/OnZeta}.
Paper Structure (28 sections, 3 theorems, 24 equations, 3 figures, 10 tables, 2 algorithms)

This paper contains 28 sections, 3 theorems, 24 equations, 3 figures, 10 tables, 2 algorithms.

Key Result

Proposition 1

The optimal solution to the problem in Eqn. eq:label is

Figures (3)

  • Figure 1: Illustration of the proposed online zero-shot transfer method (OnZeta). Blue and orange lines denote the inference in the text and vision space, respectively. By incorporating the predictions from online label learning and online proxy learning, OnZeta can leverage the biased prediction from the text space to reduce the variance in the target vision space in an online manner.
  • Figure 2: Top 4 predicted labels with corresponding probabilities from baseline zero-shot method by CLIP and our proposal.
  • Figure 3: Illustration of data distribution over $1,000$ classes with different $\alpha$ on ImageNet. Best viewed in color.

Theorems & Definitions (6)

  • Proposition 1
  • proof
  • Theorem 1
  • Theorem 2
  • proof
  • proof