Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval
Xiaowan Hu, Yiyi Chen, Yan Li, Minquan Wang, Haoqian Wang, Quan Chen, Han Li, Peng Jiang
TL;DR
The paper tackles livestreaming product retrieval (LPR) by introducing SGMN, a one-stage framework that fuses text-guided attention, spatiotemporal graph reasoning, and multi-modal hard example mining to address clutter, video-image domain gap, and fine-grained discrimination. It combines Global Representation Alignment with a Graph-based Cross-domain Interaction and a Selective Multi-modal Fusion module, optimized by triplet and cross-entropy losses to align video, image, and text representations across domains. Extensive experiments on LPR4M and MovingFashion demonstrate state-of-the-art performance, improved robustness across diverse real-world conditions, and significantly faster inference compared to two-stage approaches. The work highlights the value of leveraging ASR and product titles for salience, modeling temporal-spatial cross-domain relations, and focusing on hard negatives to achieve precise, scalable LPR in practical settings.
Abstract
With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code is available at https://github.com/Huxiaowan/SGMN.
