Table of Contents
Fetching ...

A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification

Bianca Lamm, Janis Keuper

TL;DR

This paper tackles fine-grained product classification in fast-changing retail settings by introducing a Visual RAG pipeline that combines Retrieval Augmented Generation with Vision-Language Models to perform few-shot FGC. The method builds a task-specific external knowledge base (vector store) and uses contextual few-shot samples to guide VLMs in extracting product and promotion data, including GTINs and pricing details, without retraining. Empirical results show the Visual RAG approach achieving 86.8% GTIN-based accuracy, outperforming image-only, text-only, and zero-shot multimodal baselines, with comprehensive ablations on VLMs and context. The work demonstrates practical benefits for price monitoring and product recommendations in dynamic retail environments, while analyzing biases, costs, and limitations of segmentation quality and external model dependencies.

Abstract

Despite the rapid evolution of learning and computer vision algorithms, Fine-Grained Classification (FGC) still poses an open problem in many practically relevant applications. In the retail domain, for example, the identification of fast changing and visually highly similar products and their properties are key to automated price-monitoring and product recommendation. This paper presents a novel Visual RAG pipeline that combines the Retrieval Augmented Generation (RAG) approach and Vision Language Models (VLMs) for few-shot FGC. This Visual RAG pipeline extracts product and promotion data in advertisement leaflets from various retailers and simultaneously predicts fine-grained product ids along with price and discount information. Compared to previous approaches, the key characteristic of the Visual RAG pipeline is that it allows the prediction of novel products without re-training, simply by adding a few class samples to the RAG database. Comparing several VLM back-ends like GPT-4o [23], GPT-4o-mini [24], and Gemini 2.0 Flash [10], our approach achieves 86.8% accuracy on a diverse dataset.

A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification

TL;DR

This paper tackles fine-grained product classification in fast-changing retail settings by introducing a Visual RAG pipeline that combines Retrieval Augmented Generation with Vision-Language Models to perform few-shot FGC. The method builds a task-specific external knowledge base (vector store) and uses contextual few-shot samples to guide VLMs in extracting product and promotion data, including GTINs and pricing details, without retraining. Empirical results show the Visual RAG approach achieving 86.8% GTIN-based accuracy, outperforming image-only, text-only, and zero-shot multimodal baselines, with comprehensive ablations on VLMs and context. The work demonstrates practical benefits for price monitoring and product recommendations in dynamic retail environments, while analyzing biases, costs, and limitations of segmentation quality and external model dependencies.

Abstract

Despite the rapid evolution of learning and computer vision algorithms, Fine-Grained Classification (FGC) still poses an open problem in many practically relevant applications. In the retail domain, for example, the identification of fast changing and visually highly similar products and their properties are key to automated price-monitoring and product recommendation. This paper presents a novel Visual RAG pipeline that combines the Retrieval Augmented Generation (RAG) approach and Vision Language Models (VLMs) for few-shot FGC. This Visual RAG pipeline extracts product and promotion data in advertisement leaflets from various retailers and simultaneously predicts fine-grained product ids along with price and discount information. Compared to previous approaches, the key characteristic of the Visual RAG pipeline is that it allows the prediction of novel products without re-training, simply by adding a few class samples to the RAG database. Comparing several VLM back-ends like GPT-4o [23], GPT-4o-mini [24], and Gemini 2.0 Flash [10], our approach achieves 86.8% accuracy on a diverse dataset.

Paper Structure

This paper contains 32 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Illustration of the presented Visual RAG pipeline. The pipeline is based on the RAG approach and is characterized by five main steps: Preprocessing; Vector Store; Retrieval, Classification, Relational Query; Prompt Generation; and Completion. Moreover, a contextual knowledge comprising few-shot samples with corresponding task solutions is appended to the prompt for the employed VLM. The prediction of the target GTINs serves as FGC. The additional predictions deliver to enrich the objectives.
  • Figure 2: Image from the dataset.
  • Figure 3: Product and promotion data. Missing target values are stored as NaN.
  • Figure 5: Illustrations of images that demonstrate a fine-grained difference due to variations in product weight. The evaluations of ResNet50 he2016deep and BERT kenton2019bert show misclassification of these images.
  • Figure 6: Illustrations of images that demonstrate a fine-grained difference due to the products are from different brands. The evaluation of ResNet50 he2016deep reveals the misclassification of these images.
  • ...and 10 more figures