Table of Contents
Fetching ...

Does VLM Classification Benefit from LLM Description Semantics?

Pingchuan Ma, Lennart Rietdorf, Dmytro Kotovenko, Vincent Tao Hu, Björn Ommer

TL;DR

This work questions whether LLM-generated image descriptions genuinely aid Vision-Language Model (VLM) classification or merely act as a test-time noise ensemble. It proposes a classname-free evaluation and a training-free description-selection method that leverages feedback from the VLM embedding space to identify highly distinctive, class-specific descriptions within a local neighborhood. By weighting the class-name prompt and restricting reliance on classname-free prompts, the approach achieves improved accuracy across seven benchmarks, with five descriptions per class often sufficing and performance gains exceeding baselines in several datasets. The results underscore that meaningful semantic enrichment, when carefully selected, can enhance VLM explainability and robustness, and establish a framework to distinguish semantic contributions from ensemble effects.

Abstract

Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect, where multiple modified text prompts act as a noisy test-time augmentation for the original one. We propose an alternative evaluation scenario to decide if a performance boost of LLM-generated descriptions is caused by such a noise augmentation effect or rather by genuine description semantics. The proposed scenario avoids noisy test-time augmentation and ensures that genuine, distinctive descriptions cause the performance boost. Furthermore, we propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects. Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets. Additionally, we provide insights into the explainability of description-based image classification with VLMs.

Does VLM Classification Benefit from LLM Description Semantics?

TL;DR

This work questions whether LLM-generated image descriptions genuinely aid Vision-Language Model (VLM) classification or merely act as a test-time noise ensemble. It proposes a classname-free evaluation and a training-free description-selection method that leverages feedback from the VLM embedding space to identify highly distinctive, class-specific descriptions within a local neighborhood. By weighting the class-name prompt and restricting reliance on classname-free prompts, the approach achieves improved accuracy across seven benchmarks, with five descriptions per class often sufficing and performance gains exceeding baselines in several datasets. The results underscore that meaningful semantic enrichment, when carefully selected, can enhance VLM explainability and robustness, and establish a framework to distinguish semantic contributions from ensemble effects.

Abstract

Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect, where multiple modified text prompts act as a noisy test-time augmentation for the original one. We propose an alternative evaluation scenario to decide if a performance boost of LLM-generated descriptions is caused by such a noise augmentation effect or rather by genuine description semantics. The proposed scenario avoids noisy test-time augmentation and ensures that genuine, distinctive descriptions cause the performance boost. Furthermore, we propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects. Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets. Additionally, we provide insights into the explainability of description-based image classification with VLMs.

Paper Structure

This paper contains 37 sections, 7 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Are the extra semantics provided by LLM truly useful? Our method first identifies candidate labels using only the class name. We then filter out descriptions that may seem logical but do not differentiate the group, e.g. ambiguous, overly generic, or noisy descriptions. This refinement ensures that the remaining descriptions provide distinctive vision-language cues within the local candidate neighborhood, offering more specificity than the class name alone can capture.
  • Figure 2: In the conventional setup (left), using CLIP with LLM-assigned class descriptions or even random strings can sometimes result in performance gains due to the added semantics or the smoothing ensemble effect. However, when the classname is removed, i.e. under the proposed classname-free setup (right), these descriptions will fail to perform well, as only meaningful descriptions w.r.t. the class are useful. In contrast, random strings or non-informative descriptions bring no gain.
  • Figure 3: Overall Performance of all datasets in classname-free setup. For descriptions assigned by our method and an LLM, $w_{cls}$ assesses the influence of class labels on the performance across different datasets. For a detailed discussion, see \ref{['sec:main_results']}.
  • Figure 4: Distinctiveness scores of randomly chosen images obtained by the training-free approach presented in \ref{['sec:vlm_feedback']}. Distinctiveness scores $\overline{\mathrm{diff}}_{a,a\prime\in \mathcal{A}}^d =\frac{1}{k-1} \sum_{a\prime\in \mathcal{A}} \mathrm{diff}_{a,a'}^d=\bar{s}_{a,d}-\bar{s}_{a',d}$ where $\mathrm{diff}_{a,a'}^d=\bar{s}_{a,d}-\bar{s}_{a',d} \geq 0$. Used Parameters: $k=3$, $m=5$, $n=\text{maximal}$, pool = DCLIP. See \ref{['app:distinctiveness_scores']} for a concise discussion.