Does VLM Classification Benefit from LLM Description Semantics?
Pingchuan Ma, Lennart Rietdorf, Dmytro Kotovenko, Vincent Tao Hu, Björn Ommer
TL;DR
This work questions whether LLM-generated image descriptions genuinely aid Vision-Language Model (VLM) classification or merely act as a test-time noise ensemble. It proposes a classname-free evaluation and a training-free description-selection method that leverages feedback from the VLM embedding space to identify highly distinctive, class-specific descriptions within a local neighborhood. By weighting the class-name prompt and restricting reliance on classname-free prompts, the approach achieves improved accuracy across seven benchmarks, with five descriptions per class often sufficing and performance gains exceeding baselines in several datasets. The results underscore that meaningful semantic enrichment, when carefully selected, can enhance VLM explainability and robustness, and establish a framework to distinguish semantic contributions from ensemble effects.
Abstract
Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect, where multiple modified text prompts act as a noisy test-time augmentation for the original one. We propose an alternative evaluation scenario to decide if a performance boost of LLM-generated descriptions is caused by such a noise augmentation effect or rather by genuine description semantics. The proposed scenario avoids noisy test-time augmentation and ensures that genuine, distinctive descriptions cause the performance boost. Furthermore, we propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects. Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets. Additionally, we provide insights into the explainability of description-based image classification with VLMs.
