Table of Contents
Fetching ...

Category-Oriented Representation Learning for Image to Multi-Modal Retrieval

Zida Cheng, Chen Ju, Shuai Xiao, Xu Chen, Zhonghua Zhai, Xiaoyi Zeng, Weilin Huang, Junchi Yan

TL;DR

A novel framework named organizing categories and learning by classification for retrieval (OCLEAR) is proposed, a process designed to retrieve rich multi-modal documents based on image queries, which achieves SOTA performance on public datasets and has been deployed in a real-world industrial e-commence system, leading to significant business growth.

Abstract

The rise of multi-modal search requests from users has highlighted the importance of multi-modal retrieval (i.e. image-to-text or text-to-image retrieval), yet the more complex task of image-to-multi-modal retrieval, crucial for many industry applications, remains under-explored. To address this gap and promote further research, we introduce and define the concept of Image-to-Multi-Modal Retrieval (IMMR), a process designed to retrieve rich multi-modal (i.e. image and text) documents based on image queries. We focus on representation learning for IMMR and analyze three key challenges for it: 1) skewed data and noisy label in real-world industrial data, 2) the information-inequality between image and text modality of documents when learning representations, 3) effective and efficient training in large-scale industrial contexts. To tackle the above challenges, we propose a novel framework named organizing categories and learning by classification for retrieval (OCLEAR). It consists of three components: 1) a novel category-oriented data governance scheme coupled with a large-scale classification-based learning paradigm, which handles the skewed and noisy data from a data perspective. 2) model architecture specially designed for multi-modal learning, where information-inequality between image and text modality of documents is considered for modality fusion. 3) a hybrid parallel training approach for tackling large-scale training in industrial scenario. The proposed framework achieves SOTA performance on public datasets and has been deployed in a real-world industrial e-commence system, leading to significant business growth. Code will be made publicly available.

Category-Oriented Representation Learning for Image to Multi-Modal Retrieval

TL;DR

A novel framework named organizing categories and learning by classification for retrieval (OCLEAR) is proposed, a process designed to retrieve rich multi-modal documents based on image queries, which achieves SOTA performance on public datasets and has been deployed in a real-world industrial e-commence system, leading to significant business growth.

Abstract

The rise of multi-modal search requests from users has highlighted the importance of multi-modal retrieval (i.e. image-to-text or text-to-image retrieval), yet the more complex task of image-to-multi-modal retrieval, crucial for many industry applications, remains under-explored. To address this gap and promote further research, we introduce and define the concept of Image-to-Multi-Modal Retrieval (IMMR), a process designed to retrieve rich multi-modal (i.e. image and text) documents based on image queries. We focus on representation learning for IMMR and analyze three key challenges for it: 1) skewed data and noisy label in real-world industrial data, 2) the information-inequality between image and text modality of documents when learning representations, 3) effective and efficient training in large-scale industrial contexts. To tackle the above challenges, we propose a novel framework named organizing categories and learning by classification for retrieval (OCLEAR). It consists of three components: 1) a novel category-oriented data governance scheme coupled with a large-scale classification-based learning paradigm, which handles the skewed and noisy data from a data perspective. 2) model architecture specially designed for multi-modal learning, where information-inequality between image and text modality of documents is considered for modality fusion. 3) a hybrid parallel training approach for tackling large-scale training in industrial scenario. The proposed framework achieves SOTA performance on public datasets and has been deployed in a real-world industrial e-commence system, leading to significant business growth. Code will be made publicly available.
Paper Structure (14 sections, 4 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 14 sections, 4 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 1: The industrial system exposure log and traditional training data collection paradigm. Red box is positive sample, blue box is negative sample.
  • Figure 2: OCLEAR: a novel data governance strategy, uses inherent doc information, user behaviors, and unsupervised clustering to merge data of the same category into one ID.
  • Figure 3: Model architecture consisting of four parts: image/text encoders, modality fusion, transformation module, and ID center proxies. In doc side, the concept-aware modal-fusion contains one concept extraction and one fusion network.
  • Figure 4: Hybrid parallel training consists of data parallel, model parallel, and an efficient implementation of final classification layer by KNN-comparison.
  • Figure 5: Statistical distribution of exposure categories.
  • ...and 3 more figures