Table of Contents
Fetching ...

Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

Yunqi Hong, Sohyun An, Andrew Bai, Neil Y. C. Lin, Cho-Jui Hsieh

TL;DR

Fine-grained image classification with Multimodal LLMs is hampered by subtle visual differences and label-intensive training. AutoSEP introduces an unsupervised, black-box framework that extracts discriminative image descriptions via a description-generation prompt and optimizes it through instance-level description matching, using unlabeled data. The method iteratively Reflec t and Modify steps to refine prompts, achieving consistent improvements across diverse FG datasets and MLLMs with no labeled data. These results demonstrate that unlabeled data, coupled with robust prompt optimization and a text-based description layer, can significantly boost zero-shot FG performance in practical, resource-constrained settings.

Abstract

Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: https://github.com/yq-hong/AutoSEP

Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

TL;DR

Fine-grained image classification with Multimodal LLMs is hampered by subtle visual differences and label-intensive training. AutoSEP introduces an unsupervised, black-box framework that extracts discriminative image descriptions via a description-generation prompt and optimizes it through instance-level description matching, using unlabeled data. The method iteratively Reflec t and Modify steps to refine prompts, achieving consistent improvements across diverse FG datasets and MLLMs with no labeled data. These results demonstrate that unlabeled data, coupled with robust prompt optimization and a text-based description layer, can significantly boost zero-shot FG performance in practical, resource-constrained settings.

Abstract

Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: https://github.com/yq-hong/AutoSEP

Paper Structure

This paper contains 40 sections, 8 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Top: Typical zero-shot pipeline for MLLM classification. Middle: Illustration of AutoSEP pipeline. Bottom: An example of an AutoSEP-optimized, description-generation prompt.
  • Figure 2: An illustration of automatic self-enhancing prompt learning with instance-level classification.
  • Figure 3: Evolution of metrics with increasing optimization iterations.
  • Figure 4: Classification accuracy of Gemini with various number of samples for optimization.
  • Figure 5: Examples showcasing descriptions generated from two different prompts using Gemini 1.5 Flash. Attributes highlighted in green indicate correct information, while those in red denote incorrect attributes.
  • ...and 3 more figures