Table of Contents
Fetching ...

Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval

Xiaowan Hu, Yiyi Chen, Yan Li, Minquan Wang, Haoqian Wang, Quan Chen, Han Li, Peng Jiang

TL;DR

The paper tackles livestreaming product retrieval (LPR) by introducing SGMN, a one-stage framework that fuses text-guided attention, spatiotemporal graph reasoning, and multi-modal hard example mining to address clutter, video-image domain gap, and fine-grained discrimination. It combines Global Representation Alignment with a Graph-based Cross-domain Interaction and a Selective Multi-modal Fusion module, optimized by triplet and cross-entropy losses to align video, image, and text representations across domains. Extensive experiments on LPR4M and MovingFashion demonstrate state-of-the-art performance, improved robustness across diverse real-world conditions, and significantly faster inference compared to two-stage approaches. The work highlights the value of leveraging ASR and product titles for salience, modeling temporal-spatial cross-domain relations, and focusing on hard negatives to achieve precise, scalable LPR in practical settings.

Abstract

With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code is available at https://github.com/Huxiaowan/SGMN.

Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval

TL;DR

The paper tackles livestreaming product retrieval (LPR) by introducing SGMN, a one-stage framework that fuses text-guided attention, spatiotemporal graph reasoning, and multi-modal hard example mining to address clutter, video-image domain gap, and fine-grained discrimination. It combines Global Representation Alignment with a Graph-based Cross-domain Interaction and a Selective Multi-modal Fusion module, optimized by triplet and cross-entropy losses to align video, image, and text representations across domains. Extensive experiments on LPR4M and MovingFashion demonstrate state-of-the-art performance, improved robustness across diverse real-world conditions, and significantly faster inference compared to two-stage approaches. The work highlights the value of leveraging ASR and product titles for salience, modeling temporal-spatial cross-domain relations, and focusing on hard negatives to achieve precise, scalable LPR in practical settings.

Abstract

With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code is available at https://github.com/Huxiaowan/SGMN.
Paper Structure (23 sections, 30 equations, 12 figures, 10 tables)

This paper contains 23 sections, 30 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Representative examples of cluttered and similar background products (a) or large appearance variations such as occlusion, motion, and illumination (b). The intra-domain and inter-domain graphs (c) between videos and images to enhance the spatiotemporal frame-level interaction.
  • Figure 2: The architecture of the proposed SGMN. The inputs are live video clips, text from video ASR, product images, and product titles. Paired image-video and ASR-title representations in GRA module are independently encoded and weighted for global similarity. The GCI module (Top right) constructs the video graph, image graph, and video-image graph for cross-domain spatiotemporal relation learning. The SMF module (Bottom Right) selects hard examples and fuses multi-modal features for distinguishing mining. Only the GRA module is used for inference, while the GCI and SMF modules are applied for training.
  • Figure 3: Graph construction of a cross-domain connection between video and image sequences within a batch.
  • Figure 4: Details of the cross-domain interaction module.
  • Figure 5: Ranking results of representative products in LPR.
  • ...and 7 more figures