Table of Contents
Fetching ...

SEMINAR: Search Enhanced Multi-modal Interest Network and Approximate Retrieval for Lifelong Sequential Recommendation

Kaiming Shen, Xichen Ding, Zixiang Zheng, Yuqi Gong, Qianqian Li, Zhongyi Liu, Guannan Zhang

TL;DR

SEMINAR tackles the problems of modeling extremely long lifelong user histories with insufficient ID embedding learning and misaligned multi-modal item features. It introduces a Pretraining Search Unit (PSU) to learn unified lifelong sequences of multi-modal query-item pairs and a downstream two-stage retrieval (GSU/ESU) that reuses PSU embeddings with modality-aware projections. To enable online speed, SEMINAR employs a multi-modal product quantization strategy for approximate retrieval, significantly reducing attention computation while preserving recall. Experiments on real-world datasets show SEMINAR outperforms strong baselines in lifelong modeling and retrieval recall, with ablations confirming the importance of multi-modal alignment and pretraining tasks.

Abstract

The modeling of users' behaviors is crucial in modern recommendation systems. A lot of research focuses on modeling users' lifelong sequences, which can be extremely long and sometimes exceed thousands of items. These models use the target item to search for the most relevant items from the historical sequence. However, training lifelong sequences in click through rate (CTR) prediction or personalized search ranking (PSR) is extremely difficult due to the insufficient learning problem of ID embedding, especially when the IDs in the lifelong sequence features do not exist in the samples of training dataset. Additionally, existing target attention mechanisms struggle to learn the multi-modal representations of items in the sequence well. The distribution of multi-modal embedding (text, image and attributes) output of user's interacted items are not properly aligned and there exist divergence across modalities. We also observe that users' search query sequences and item browsing sequences can fully depict users' intents and benefit from each other. To address these challenges, we propose a unified lifelong multi-modal sequence model called SEMINAR-Search Enhanced Multi-Modal Interest Network and Approximate Retrieval. Specifically, a network called Pretraining Search Unit (PSU) learns the lifelong sequences of multi-modal query-item pairs in a pretraining-finetuning manner with multiple objectives: multi-modal alignment, next query-item pair prediction, query-item relevance prediction, etc. After pretraining, the downstream model restores the pretrained embedding as initialization and finetunes the network. To accelerate the online retrieval speed of multi-modal embedding, we propose a multi-modal codebook-based product quantization strategy to approximate the exact attention calculati

SEMINAR: Search Enhanced Multi-modal Interest Network and Approximate Retrieval for Lifelong Sequential Recommendation

TL;DR

SEMINAR tackles the problems of modeling extremely long lifelong user histories with insufficient ID embedding learning and misaligned multi-modal item features. It introduces a Pretraining Search Unit (PSU) to learn unified lifelong sequences of multi-modal query-item pairs and a downstream two-stage retrieval (GSU/ESU) that reuses PSU embeddings with modality-aware projections. To enable online speed, SEMINAR employs a multi-modal product quantization strategy for approximate retrieval, significantly reducing attention computation while preserving recall. Experiments on real-world datasets show SEMINAR outperforms strong baselines in lifelong modeling and retrieval recall, with ablations confirming the importance of multi-modal alignment and pretraining tasks.

Abstract

The modeling of users' behaviors is crucial in modern recommendation systems. A lot of research focuses on modeling users' lifelong sequences, which can be extremely long and sometimes exceed thousands of items. These models use the target item to search for the most relevant items from the historical sequence. However, training lifelong sequences in click through rate (CTR) prediction or personalized search ranking (PSR) is extremely difficult due to the insufficient learning problem of ID embedding, especially when the IDs in the lifelong sequence features do not exist in the samples of training dataset. Additionally, existing target attention mechanisms struggle to learn the multi-modal representations of items in the sequence well. The distribution of multi-modal embedding (text, image and attributes) output of user's interacted items are not properly aligned and there exist divergence across modalities. We also observe that users' search query sequences and item browsing sequences can fully depict users' intents and benefit from each other. To address these challenges, we propose a unified lifelong multi-modal sequence model called SEMINAR-Search Enhanced Multi-Modal Interest Network and Approximate Retrieval. Specifically, a network called Pretraining Search Unit (PSU) learns the lifelong sequences of multi-modal query-item pairs in a pretraining-finetuning manner with multiple objectives: multi-modal alignment, next query-item pair prediction, query-item relevance prediction, etc. After pretraining, the downstream model restores the pretrained embedding as initialization and finetunes the network. To accelerate the online retrieval speed of multi-modal embedding, we propose a multi-modal codebook-based product quantization strategy to approximate the exact attention calculati
Paper Structure (23 sections, 12 equations, 4 figures, 4 tables)

This paper contains 23 sections, 12 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of SEMINAR Model Architecture. $S_{i}$ denotes the i-th behavior of query and item pair in the lifelong sequence. Each behavior has multiple channels of query and multi-modal features of text, image and attributes. PSU denotes the pretraining search unit. GSU and ESU denote the general and exact search unit respectively as the two stage paradigm.
  • Figure 2: $\text{Recall@K}$ Evaluation of Different Approximate Fast Retrieval Methods on Synthetic Dataset of the Multi-Modal Lifelong Sequence. Plots in the first row denote the group of same norm $|x^{(m)}|$ same weight $|\gamma_{m}|$, plots in the second row denote the group of different norm $|x^{(m)}|$ and same weight $|\gamma_{m}|$, and plots in the third row denote the group of the same norm $|x^{(m)}|$ and different weight $|\gamma_{m}|$.
  • Figure 3: Influence of Number of Pretraining Epochs on $\text{NDCG@K}$ performance of KuaiSAR Dataset
  • Figure 4: Influence of Query and Item Representation Fusion Weight $\lambda$ on NDCG@K performance of KuaiSAR Dataset