Table of Contents
Fetching ...

Label Propagation for Zero-shot Classification with Vision-Language Models

Vladan Stojnić, Yannis Kalantidis, Giorgos Tolias

TL;DR

This work tackles zero-shot classification with unlabeled data by introducing ZLaP, a non-parametric label-propagation method tailored to bi-modal vision-language graphs. By constructing a graph that links text-based class representations with image features and exploiting geodesic similarities from the inverted graph Laplacian, ZLaP achieves strong transductive and inductive performance without model fine-tuning. The approach includes efficient dual formulations and sparsified offline components to enable scalable test-time inference, and it benefits further when combined with InMaP proxies and LLM-generated prompts. Empirically, ZLaP delivers state-of-the-art results on 14 diverse datasets across multiple VLM backbones, demonstrating robust improvements and practical applicability, even for black-box VLMs.

Abstract

Vision-Language Models (VLMs) have demonstrated impressive performance on zero-shot classification, i.e. classification when provided merely with a list of class names. In this paper, we tackle the case of zero-shot classification in the presence of unlabeled data. We leverage the graph structure of the unlabeled data and introduce ZLaP, a method based on label propagation (LP) that utilizes geodesic distances for classification. We tailor LP to graphs containing both text and image features and further propose an efficient method for performing inductive inference based on a dual solution and a sparsification step. We perform extensive experiments to evaluate the effectiveness of our method on 14 common datasets and show that ZLaP outperforms the latest related works. Code: https://github.com/vladan-stojnic/ZLaP

Label Propagation for Zero-shot Classification with Vision-Language Models

TL;DR

This work tackles zero-shot classification with unlabeled data by introducing ZLaP, a non-parametric label-propagation method tailored to bi-modal vision-language graphs. By constructing a graph that links text-based class representations with image features and exploiting geodesic similarities from the inverted graph Laplacian, ZLaP achieves strong transductive and inductive performance without model fine-tuning. The approach includes efficient dual formulations and sparsified offline components to enable scalable test-time inference, and it benefits further when combined with InMaP proxies and LLM-generated prompts. Empirically, ZLaP delivers state-of-the-art results on 14 diverse datasets across multiple VLM backbones, demonstrating robust improvements and practical applicability, even for black-box VLMs.

Abstract

Vision-Language Models (VLMs) have demonstrated impressive performance on zero-shot classification, i.e. classification when provided merely with a list of class names. In this paper, we tackle the case of zero-shot classification in the presence of unlabeled data. We leverage the graph structure of the unlabeled data and introduce ZLaP, a method based on label propagation (LP) that utilizes geodesic distances for classification. We tailor LP to graphs containing both text and image features and further propose an efficient method for performing inductive inference based on a dual solution and a sparsification step. We perform extensive experiments to evaluate the effectiveness of our method on 14 common datasets and show that ZLaP outperforms the latest related works. Code: https://github.com/vladan-stojnic/ZLaP
Paper Structure (41 sections, 11 equations, 5 figures, 11 tables)

This paper contains 41 sections, 11 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Zero-shot classification performance over 14 datasets using the proposed ZLaP classifier over CLIP rkh+21, as well as over the (concurrent) InMaP qxh23 approach. Our method offers performance gains for both transductive (left) and inductive (right) inference. Average accuracy over 14 common datasets is reported.
  • Figure 2: t-SNE visualization for the original CLIP features (left) and our geodesic similarity (right). The former is estimated with the features as input, while the latter with the $L_{\text{inv}}$ used as a pairwise similarity matrix. $\star$: class representation, $\bullet$: image representation. Figure generated for five random classes from the CUB dataset.
  • Figure 3: Similarity distributions among features of the same or different modality, using 7 textual templates uga23 (left) or the InMaP proxies (right) as class representations.
  • Figure 4: Sparcifying matrix $\hat{Y}$ for inductive CLIP+ZLaP: effect of maintaining only the top elements per row/column/matrix.
  • Figure 5: Zero-shot classification accuracy averaged over 14 datasets for the transductive (top) and inductive (bottom) setups. Results per dataset are reported in the supplementary material.