Table of Contents
Fetching ...

Leveraging Scene Context with Dual Networks for Sequential User Behavior Modeling

Xu Chen, Yunmeng Shu, Yuangang Pan, Jinsong Lan, Xiaoyong Zhu, Shuai Xiao, Haojin Zhu, Ivor W. Tsang, Bo Zheng

TL;DR

This work addresses sequential user behavior modeling in the presence of scene context, a latent but impactful contextual factor in apps. It introduces DSPnet, a dual sequence prediction network with parallel scene and item encoders, a sequence feature enhancement module, and a Conditional Contrastive Regularization (CCR) loss to robustify representation learning. The framework is theoretically grounded in a joint log-likelihood (ELBO) perspective and demonstrates strong empirical gains on public and industrial datasets, including a measurable online CTR improvement and GMV increase after deployment. The results highlight the practical value of incorporating scene context and cross-sequence interactions for scalable, production-ready sequential prediction in information retrieval systems.

Abstract

Modeling sequential user behaviors for future behavior prediction is crucial in improving user's information retrieval experience. Recent studies highlight the importance of incorporating contextual information to enhance prediction performance. One crucial but usually neglected contextual information is the scene feature which we define as sub-interfaces within an app, created by developers to provide specific functionalities, such as ``text2product search" and ``live" modules in e-commence apps. Different scenes exhibit distinct functionalities and usage habits, leading to significant distribution gap in user engagement across them. Popular sequential behavior models either ignore the scene feature or merely use it as attribute embeddings, which cannot effectively capture the dynamic interests and interplay between scenes and items when modeling user sequences. In this work, we propose a novel Dual Sequence Prediction networks (DSPnet) to effectively capture the dynamic interests and interplay between scenes and items for future behavior prediction. DSPnet consists of two parallel networks dedicated to learn users' dynamic interests over items and scenes, and a sequence feature enhancement module to capture the interplay for enhanced future behavior prediction. Further, we introduce a Conditional Contrastive Regularization (CCR) loss to capture the invariance of similar historical sequences. Theoretical analysis suggests that DSPnet is a principled way to learn the joint relationships between scene and item sequences. Extensive experiments are conducted on one public benchmark and two collected industrial datasets. The method has been deployed online in our system, bringing a 0.04 point increase in CTR, 0.78\% growth in deals, and 0.64\% rise in GMV. The codes are available at this anonymous github: \textcolor{blue}{https://anonymous.4open.science/r/DSPNet-ForPublish-2506/}.

Leveraging Scene Context with Dual Networks for Sequential User Behavior Modeling

TL;DR

This work addresses sequential user behavior modeling in the presence of scene context, a latent but impactful contextual factor in apps. It introduces DSPnet, a dual sequence prediction network with parallel scene and item encoders, a sequence feature enhancement module, and a Conditional Contrastive Regularization (CCR) loss to robustify representation learning. The framework is theoretically grounded in a joint log-likelihood (ELBO) perspective and demonstrates strong empirical gains on public and industrial datasets, including a measurable online CTR improvement and GMV increase after deployment. The results highlight the practical value of incorporating scene context and cross-sequence interactions for scalable, production-ready sequential prediction in information retrieval systems.

Abstract

Modeling sequential user behaviors for future behavior prediction is crucial in improving user's information retrieval experience. Recent studies highlight the importance of incorporating contextual information to enhance prediction performance. One crucial but usually neglected contextual information is the scene feature which we define as sub-interfaces within an app, created by developers to provide specific functionalities, such as ``text2product search" and ``live" modules in e-commence apps. Different scenes exhibit distinct functionalities and usage habits, leading to significant distribution gap in user engagement across them. Popular sequential behavior models either ignore the scene feature or merely use it as attribute embeddings, which cannot effectively capture the dynamic interests and interplay between scenes and items when modeling user sequences. In this work, we propose a novel Dual Sequence Prediction networks (DSPnet) to effectively capture the dynamic interests and interplay between scenes and items for future behavior prediction. DSPnet consists of two parallel networks dedicated to learn users' dynamic interests over items and scenes, and a sequence feature enhancement module to capture the interplay for enhanced future behavior prediction. Further, we introduce a Conditional Contrastive Regularization (CCR) loss to capture the invariance of similar historical sequences. Theoretical analysis suggests that DSPnet is a principled way to learn the joint relationships between scene and item sequences. Extensive experiments are conducted on one public benchmark and two collected industrial datasets. The method has been deployed online in our system, bringing a 0.04 point increase in CTR, 0.78\% growth in deals, and 0.64\% rise in GMV. The codes are available at this anonymous github: \textcolor{blue}{https://anonymous.4open.science/r/DSPNet-ForPublish-2506/}.

Paper Structure

This paper contains 23 sections, 1 theorem, 16 equations, 21 figures, 8 tables.

Key Result

Lemma 1

Without specifying the sequential encoder architecture and prediction objective function, minimizing the dual sequence learning scheme is equivalent to maximizing the following evidence lower bound of the joint log-likelihoods of observed item and scene sequential behaviors: where $\boldsymbol{v}$ and $\boldsymbol{s}$ denote the observed item and scene, respectively. $\mathcal{V}$ and $\mathcal{S}

Figures (21)

  • Figure 2: The architecture of DSPnet. Its dual sequence learning models the interplay between scene and item sequences while capturing the sequence dynamics against user intention misalignment issue of scene-item data. CCR loss learns representation invariance with different forces on different samples. $\oplus$ means concatenation.
  • Figure : (a)
  • Figure : (a) CTR under Query Category
  • Figure : (a)
  • Figure : (a) ContraRec
  • ...and 16 more figures

Theorems & Definitions (1)

  • Lemma 1