Table of Contents
Fetching ...

FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

Zhen Wang, Da Li, Yulin Su, Min Yang, Minghui Qiu, Walton Wang

TL;DR

FashionLOGO is proposed, to explore how to prompt MLLMs to generate appropriate text for product images, which can help visual models achieve better logo embeddings, and adopts a cross-attention transformer block that enables visual embedding to automatically learn supplementary knowledge from textual embedding.

Abstract

Logo embedding models convert the product logos in images into vectors, enabling their utilization for logo recognition and detection within e-commerce platforms. This facilitates the enforcement of intellectual property rights and enhances product search capabilities. However, current methods treat logo embedding as a purely visual problem. A noteworthy issue is that visual models capture features more than logos. Instead, we view this as a multimodal task, using text as auxiliary information to facilitate the visual model's understanding of the logo. The emerging Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in both visual and textual understanding. Inspired by this, we propose an approach, \textbf{FashionLOGO}, to explore how to prompt MLLMs to generate appropriate text for product images, which can help visual models achieve better logo embeddings. We adopt a cross-attention transformer block that enables visual embedding to automatically learn supplementary knowledge from textual embedding. Our extensive experiments on real-world datasets prove that FashionLOGO is capable of generating generic and robust logo embeddings, achieving state-of-the-art performance in all benchmarks.

FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

TL;DR

FashionLOGO is proposed, to explore how to prompt MLLMs to generate appropriate text for product images, which can help visual models achieve better logo embeddings, and adopts a cross-attention transformer block that enables visual embedding to automatically learn supplementary knowledge from textual embedding.

Abstract

Logo embedding models convert the product logos in images into vectors, enabling their utilization for logo recognition and detection within e-commerce platforms. This facilitates the enforcement of intellectual property rights and enhances product search capabilities. However, current methods treat logo embedding as a purely visual problem. A noteworthy issue is that visual models capture features more than logos. Instead, we view this as a multimodal task, using text as auxiliary information to facilitate the visual model's understanding of the logo. The emerging Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in both visual and textual understanding. Inspired by this, we propose an approach, \textbf{FashionLOGO}, to explore how to prompt MLLMs to generate appropriate text for product images, which can help visual models achieve better logo embeddings. We adopt a cross-attention transformer block that enables visual embedding to automatically learn supplementary knowledge from textual embedding. Our extensive experiments on real-world datasets prove that FashionLOGO is capable of generating generic and robust logo embeddings, achieving state-of-the-art performance in all benchmarks.
Paper Structure (14 sections, 5 equations, 2 figures, 3 tables)

This paper contains 14 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: FashionLOGO framework overview. a) shows the traditional pipeline for training a logo embedding model and b) shows the training pipeline of FashionLOGO. We utilize the image and text encoder from CLIP to extract the image and text embeddings in which the text inputs are generated by LLaVA offline. Then, a cross-attention transformer is adopted to enhance the image embedding by learning textual embedding through the training process.
  • Figure 2: Qualitative analysis. The three rows represent queries, and the Top-1 results are obtained from ViT and FashionLOGO, respectively. The correct results are in green frame and the incorrect ones are in red.