Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models

Ashhadul Islam; Md. Rafiul Biswas; Wajdi Zaghouani; Samir Brahim Belhaouari; Zubair Shah

Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models

Ashhadul Islam, Md. Rafiul Biswas, Wajdi Zaghouani, Samir Brahim Belhaouari, Zubair Shah

TL;DR

This work investigates zero-shot image classification with large multimodal models, focusing on LLaVA-1.5, by applying prompt-engineering across four diverse datasets and evaluating zero-shot performance without fine-tuning. It provides a detailed architectural blueprint of LLaVA, including CLIP-based vision encoding, a vision-language projection, and an LLAMA backbone, with a two-stage training regime and subsequent improvements in LLaVA-1.5 (MLP connector, larger LLM, higher resolution). The study demonstrates strong zero-shot results ($85\%$, $77\%$, $100\%$, $79\%$) and shows noticeable gains after fine-tuning on an autism-face dataset ($55\%$ to $83\%$) with modest data and epochs, illustrating the potential for medical-imaging applications. Despite limitations such as single-image processing, context-length constraints, and hallucination risks, the work highlights the practical utility and reproducibility of multimodal instruction-following systems for classification and domain-specific tasks.

Abstract

$ $The synergy of language and vision models has given rise to Large Language and Vision Assistant models (LLVAs), designed to engage users in rich conversational experiences intertwined with image-based queries. These comprehensive multimodal models seamlessly integrate vision encoders with Large Language Models (LLMs), expanding their applications in general-purpose language and visual comprehension. The advent of Large Multimodal Models (LMMs) heralds a new era in Artificial Intelligence (AI) assistance, extending the horizons of AI utilization. This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts designed for specific datasets. We also investigate the LLVAs zero-shot learning capabilities. Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images. The results of our experiments demonstrate the model's remarkable performance, achieving classification accuracies of 85\%, 100\%, 77\%, and 79\% for the respective datasets without any fine-tuning. To bolster our analysis, we assess the model's performance post fine-tuning for specific tasks. In one instance, fine-tuning is conducted over a dataset comprising images of faces of children with and without autism. Prior to fine-tuning, the model demonstrated a test accuracy of 55\%, which significantly improved to 83\% post fine-tuning. These results, coupled with our prior findings, underscore the transformative potential of LLVAs and their versatile applications in real-world scenarios.

Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models

TL;DR

) and shows noticeable gains after fine-tuning on an autism-face dataset (

) with modest data and epochs, illustrating the potential for medical-imaging applications. Despite limitations such as single-image processing, context-length constraints, and hallucination risks, the work highlights the practical utility and reproducibility of multimodal instruction-following systems for classification and domain-specific tasks.

Abstract

The synergy of language and vision models has given rise to Large Language and Vision Assistant models (LLVAs), designed to engage users in rich conversational experiences intertwined with image-based queries. These comprehensive multimodal models seamlessly integrate vision encoders with Large Language Models (LLMs), expanding their applications in general-purpose language and visual comprehension. The advent of Large Multimodal Models (LMMs) heralds a new era in Artificial Intelligence (AI) assistance, extending the horizons of AI utilization. This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts designed for specific datasets. We also investigate the LLVAs zero-shot learning capabilities. Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images. The results of our experiments demonstrate the model's remarkable performance, achieving classification accuracies of 85\%, 100\%, 77\%, and 79\% for the respective datasets without any fine-tuning. To bolster our analysis, we assess the model's performance post fine-tuning for specific tasks. In one instance, fine-tuning is conducted over a dataset comprising images of faces of children with and without autism. Prior to fine-tuning, the model demonstrated a test accuracy of 55\%, which significantly improved to 83\% post fine-tuning. These results, coupled with our prior findings, underscore the transformative potential of LLVAs and their versatile applications in real-world scenarios.

Paper Structure (14 sections, 6 figures, 4 tables)

This paper contains 14 sections, 6 figures, 4 tables.

Introduction
Contributions
Large Language and Vision Assistant (LLaVA)
Components of LLaVA
LLaVA 1.5
LLaVA1.5 in action
Methodology
Memory Management
System specifications
Datasets Used
Dataset used for fine-tuning
Results
Results On Fine-Tuning
Conclusion

Figures (6)

Figure 1: Architecture of LLaVA
Figure 2: Overall methodology of the experiment. The class label was achieved using a combination of individual test images and a customised prompt
Figure 3: Images in the MNIST dataset deng2012mnist
Figure 4: Images in the CatsVDogs dogs-vs-cats, AntsVbees Melody and PoxVNoPox ahsan2022monkeypox data respectively
Figure 5: Images of faces of children with or without autism Autistic.
...and 1 more figures

Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models

TL;DR

Abstract

Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)