Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models
Ashhadul Islam, Md. Rafiul Biswas, Wajdi Zaghouani, Samir Brahim Belhaouari, Zubair Shah
TL;DR
This work investigates zero-shot image classification with large multimodal models, focusing on LLaVA-1.5, by applying prompt-engineering across four diverse datasets and evaluating zero-shot performance without fine-tuning. It provides a detailed architectural blueprint of LLaVA, including CLIP-based vision encoding, a vision-language projection, and an LLAMA backbone, with a two-stage training regime and subsequent improvements in LLaVA-1.5 (MLP connector, larger LLM, higher resolution). The study demonstrates strong zero-shot results ($85\%$, $77\%$, $100\%$, $79\%$) and shows noticeable gains after fine-tuning on an autism-face dataset ($55\%$ to $83\%$) with modest data and epochs, illustrating the potential for medical-imaging applications. Despite limitations such as single-image processing, context-length constraints, and hallucination risks, the work highlights the practical utility and reproducibility of multimodal instruction-following systems for classification and domain-specific tasks.
Abstract
$ $The synergy of language and vision models has given rise to Large Language and Vision Assistant models (LLVAs), designed to engage users in rich conversational experiences intertwined with image-based queries. These comprehensive multimodal models seamlessly integrate vision encoders with Large Language Models (LLMs), expanding their applications in general-purpose language and visual comprehension. The advent of Large Multimodal Models (LMMs) heralds a new era in Artificial Intelligence (AI) assistance, extending the horizons of AI utilization. This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts designed for specific datasets. We also investigate the LLVAs zero-shot learning capabilities. Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images. The results of our experiments demonstrate the model's remarkable performance, achieving classification accuracies of 85\%, 100\%, 77\%, and 79\% for the respective datasets without any fine-tuning. To bolster our analysis, we assess the model's performance post fine-tuning for specific tasks. In one instance, fine-tuning is conducted over a dataset comprising images of faces of children with and without autism. Prior to fine-tuning, the model demonstrated a test accuracy of 55\%, which significantly improved to 83\% post fine-tuning. These results, coupled with our prior findings, underscore the transformative potential of LLVAs and their versatile applications in real-world scenarios.
