Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models

Anil Osman Tur; Alessandro Conti; Cigdem Beyan; Davide Boscaini; Roberto Larcher; Stefano Messelodi; Fabio Poiesi; Elisa Ricci

Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models

Anil Osman Tur, Alessandro Conti, Cigdem Beyan, Davide Boscaini, Roberto Larcher, Stefano Messelodi, Fabio Poiesi, Elisa Ricci

TL;DR

This work tackles fine-grained zero-shot retail product discrimination using vision-language models. It introduces the MIMEX dataset of 28 product categories captured in realistic smart-retail scenarios and shows that vanilla zero-shot performance of state-of-the-art VLMs is insufficient for fine-grained distinctions. To address this, the authors propose an ensemble framework that combines textual prompts, BLIP2-generated captions, and visual embeddings from CLIP and DINOv2 with PCA, complemented by a nearest-prototype classifier and few-shot visual adaptation. Their results indicate that textual descriptions deliver strong open-set performance, while visual prototypes and multimodal ensembles yield robust gains, with notable improvements over baselines; the dataset and benchmark are released to accelerate research in zero-shot retail classification.

Abstract

In smart retail applications, the large number of products and their frequent turnover necessitate reliable zero-shot object classification methods. The zero-shot assumption is essential to avoid the need for re-training the classifier every time a new product is introduced into stock or an existing product undergoes rebranding. In this paper, we make three key contributions. Firstly, we introduce the MIMEX dataset, comprising 28 distinct product categories. Unlike existing datasets in the literature, MIMEX focuses on fine-grained product classification and includes a diverse range of retail products. Secondly, we benchmark the zero-shot object classification performance of state-of-the-art vision-language models (VLMs) on the proposed MIMEX dataset. Our experiments reveal that these models achieve unsatisfactory fine-grained classification performance, highlighting the need for specialized approaches. Lastly, we propose a novel ensemble approach that integrates embeddings from CLIP and DINOv2 with dimensionality reduction techniques to enhance classification performance. By combining these components, our ensemble approach outperforms VLMs, effectively capturing visual cues crucial for fine-grained product discrimination. Additionally, we introduce a class adaptation method that utilizes visual prototyping with limited samples in scenarios with scarce labeled data, addressing a critical need in retail environments where product variety frequently changes. To encourage further research into zero-shot object classification for smart retail applications, we will release both the MIMEX dataset and benchmark to the research community. Interested researchers can contact the authors for details on the terms and conditions of use. The code is available: https://github.com/AnilOsmanTur/Zero-shot-Retail-Product-Classification.

Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models

TL;DR

Abstract

Paper Structure (8 sections, 1 equation, 4 figures, 5 tables)

This paper contains 8 sections, 1 equation, 4 figures, 5 tables.

Introduction
Related Work
Dataset
Method
Preliminaries
Our approach
Experiments
Discussions and Conclusions

Figures (4)

Figure 1: Example images from the MIMEX dataset showcasing the cropped patches with various orientations and occlusions. The dataset includes visually similar products, such as pasta sauces, potato chips cans, and chocolates with similar packaging, highlighting the challenge of fine-grained classification in retail environments.
Figure 2: Distribution of image samples across product categories in the MIMEX dataset, with colors indicating the allocation to the test (orange) and train (blue) splits.
Figure 3: (a) Prototype generation: This panel illustrates the process of creating prototypes by extracting and refining visual embeddings from image data. (b) Integration of CLIP and DINOv2 features to create visual prototypes: This panel shows how visual embeddings from CLIP and DINOv2 are combined and the use of PCA to reduce dimensionality and enhance the clarity of the prototypes, facilitating improved understanding within contexts.
Figure 4: (a) UMAP visualization of CLIP's feature space on the MIMEX dataset, with distinct class representations in different colors and class centroids marked by stars. (b) Visualization of class centroids based on prompt predictions, emphasizing the central points of each class.

Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models

TL;DR

Abstract

Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)