Table of Contents
Fetching ...

Enhancing Visual Classification using Comparative Descriptors

Hankyeol Lee, Gawon Seo, Wonseok Choi, Geunyoung Jung, Kyungwoo Song, Jiyoung Jung

TL;DR

Improved accuracy and robustness in visual classification tasks is demonstrated by addressing the specific challenge of subtle inter-class differences by generating and integrating comparative descriptors, which refine the semantic focus and improve classification accuracy.

Abstract

The performance of vision-language models (VLMs), such as CLIP, in visual classification tasks, has been enhanced by leveraging semantic knowledge from large language models (LLMs), including GPT. Recent studies have shown that in zero-shot classification tasks, descriptors incorporating additional cues, high-level concepts, or even random characters often outperform those using only the category name. In many classification tasks, while the top-1 accuracy may be relatively low, the top-5 accuracy is often significantly higher. This gap implies that most misclassifications occur among a few similar classes, highlighting the model's difficulty in distinguishing between classes with subtle differences. To address this challenge, we introduce a novel concept of comparative descriptors. These descriptors emphasize the unique features of a target class against its most similar classes, enhancing differentiation. By generating and integrating these comparative descriptors into the classification framework, we refine the semantic focus and improve classification accuracy. An additional filtering process ensures that these descriptors are closer to the image embeddings in the CLIP space, further enhancing performance. Our approach demonstrates improved accuracy and robustness in visual classification tasks by addressing the specific challenge of subtle inter-class differences.

Enhancing Visual Classification using Comparative Descriptors

TL;DR

Improved accuracy and robustness in visual classification tasks is demonstrated by addressing the specific challenge of subtle inter-class differences by generating and integrating comparative descriptors, which refine the semantic focus and improve classification accuracy.

Abstract

The performance of vision-language models (VLMs), such as CLIP, in visual classification tasks, has been enhanced by leveraging semantic knowledge from large language models (LLMs), including GPT. Recent studies have shown that in zero-shot classification tasks, descriptors incorporating additional cues, high-level concepts, or even random characters often outperform those using only the category name. In many classification tasks, while the top-1 accuracy may be relatively low, the top-5 accuracy is often significantly higher. This gap implies that most misclassifications occur among a few similar classes, highlighting the model's difficulty in distinguishing between classes with subtle differences. To address this challenge, we introduce a novel concept of comparative descriptors. These descriptors emphasize the unique features of a target class against its most similar classes, enhancing differentiation. By generating and integrating these comparative descriptors into the classification framework, we refine the semantic focus and improve classification accuracy. An additional filtering process ensures that these descriptors are closer to the image embeddings in the CLIP space, further enhancing performance. Our approach demonstrates improved accuracy and robustness in visual classification tasks by addressing the specific challenge of subtle inter-class differences.

Paper Structure

This paper contains 23 sections, 3 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our method and the baselines(CLIP Radford-2021-CLIP, GPT-descriptor-extended CLIP (DCLIP) Menon-2023-DCLIP, and WaffleCLIP Roth-2023-Waffle). (left) Comparison of our method with the baselines. Our proposed method outperforms the baselines on CLIP's image classification task. (right) A detailed overview of our method. Our method generates descriptors through comparison with semantically similar classes. Then the filtering process is applied to retain only descriptors that are useful for classification, significantly increasing the classification accuracy.
  • Figure 2: Addressing ambiguity in descriptor generation. We compare the descriptors generated by our method and the DCLIP Menon-2023-DCLIP method on the (top) Describable Texture Dataset (DTD) Cimpoi-2014-DTD and (bottom) Flowers102 (Flowers) Nilsback-2008-Flowers102 dataset. DCLIP method failed to generate descriptors due to ambiguity (e.g. the word banded refers to both texture and a species of snake), and in some cases generated descriptors that were unrelated to the class. On the other hand, our method not only avoided failures but also enriched in context. This difference leads to a significant improvement in classification accuracy.
  • Figure 3: Comparison of generated descriptors between similar classes. Descriptors were generated for similar classes in the Places dataset. Bold text indicates an attribute that appears in both classes and highlighted text indicates a distinct attribute. The descriptors generated by DCLIP have semantically equivalent descriptors between the target class and its similar class. In contrast, our method minimizes semantically analogous descriptors, adds distinct features, and provides more detailed explanations. As a result, this makes it easier to distinguish between similar classes.
  • Figure 4: Examples of explainability. (left) We show examples of decisions and justifications through our model. (right) We present the descriptors of each target class and the similar class within the datasets (from the top, Places365 Zhou-2018-Places365, Food101 Bossard-2014-Food101, Flowers102 Nilsback-2008-Flowers102). The chart shows the similarity score between the descriptor and the image in the CLIP latent space. The higher the similarity score, the greater the influence of the descriptor on the decision-making process. The descriptor with the highest average similarity becomes the model's classification prediction.