Table of Contents
Fetching ...

RetailKLIP : Finetuning OpenCLIP backbone using metric learning on a single GPU for Zero-shot retail product image classification

Muktabh Mayank Srivastava

TL;DR

RetailKLIP finetunes the vision encoder of CLIP on a large retail dataset to produce embeddings suitable for zero-shot, nearest-neighbor classification, addressing the need for rapid updates when new products launch. By combining ArcFace metric learning, blockwise learning rate decay, and data-balanced training on RP6K, the method achieves competitive zero-shot accuracy on CAPG-GP, Grozi-120, and RP2K while operating on a single GPU. The approach reduces retraining costs and latency for adding new products, enabling scalable deployment in retail environments. Overall, RetailKLIP demonstrates that targeted fine-tuning of a large vision model can rival full finetuning in practical, open-set retail recognition tasks with explicit resource constraints.

Abstract

Retail product or packaged grocery goods images need to classified in various computer vision applications like self checkout stores, supply chain automation and retail execution evaluation. Previous works explore ways to finetune deep models for this purpose. But because of the fact that finetuning a large model or even linear layer for a pretrained backbone requires to run at least a few epochs of gradient descent for every new retail product added in classification range, frequent retrainings are needed in a real world scenario. In this work, we propose finetuning the vision encoder of a CLIP model in a way that its embeddings can be easily used for nearest neighbor based classification, while also getting accuracy close to or exceeding full finetuning. A nearest neighbor based classifier needs no incremental training for new products, thus saving resources and wait time.

RetailKLIP : Finetuning OpenCLIP backbone using metric learning on a single GPU for Zero-shot retail product image classification

TL;DR

RetailKLIP finetunes the vision encoder of CLIP on a large retail dataset to produce embeddings suitable for zero-shot, nearest-neighbor classification, addressing the need for rapid updates when new products launch. By combining ArcFace metric learning, blockwise learning rate decay, and data-balanced training on RP6K, the method achieves competitive zero-shot accuracy on CAPG-GP, Grozi-120, and RP2K while operating on a single GPU. The approach reduces retraining costs and latency for adding new products, enabling scalable deployment in retail environments. Overall, RetailKLIP demonstrates that targeted fine-tuning of a large vision model can rival full finetuning in practical, open-set retail recognition tasks with explicit resource constraints.

Abstract

Retail product or packaged grocery goods images need to classified in various computer vision applications like self checkout stores, supply chain automation and retail execution evaluation. Previous works explore ways to finetune deep models for this purpose. But because of the fact that finetuning a large model or even linear layer for a pretrained backbone requires to run at least a few epochs of gradient descent for every new retail product added in classification range, frequent retrainings are needed in a real world scenario. In this work, we propose finetuning the vision encoder of a CLIP model in a way that its embeddings can be easily used for nearest neighbor based classification, while also getting accuracy close to or exceeding full finetuning. A nearest neighbor based classifier needs no incremental training for new products, thus saving resources and wait time.
Paper Structure (12 sections, 1 figure, 2 tables)

This paper contains 12 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: RetailKLIP is trained on RP6K. Its then evaluated for Zero Shot classification on Grozi-120, CAPG-GP and RP2K