Table of Contents
Fetching ...

COURIER: Contrastive User Intention Reconstruction for Large-Scale Visual Recommendation

Jia-Qi Yang, Chenglei Dai, Dan OU, Dongshuai Li, Ju Huang, De-Chuan Zhan, Xiaoyi Zeng, Yang Yang

TL;DR

COURIER introduces a contrastive, user-intention reconstruction framework that learns visual item representations by reconstructing a user’s next clicked item from their history using a cross-attention mechanism. The method creates a many-to-one correspondence between history images and current items and employs a specialized contrastive objective to prevent embedding collapse, yielding improvements in offline AUC ($ ext{AUC}$) and online GMV ($ ext{GMV}$) on Taobao. Empirical results show strong gains on public datasets and robust improvements in the Taobao production system, with practical deployment details addressing large-scale training and serving. The approach demonstrates that user-behavior–driven visual features can more effectively capture downstream CTR signals than standard cross-modal pre-training approaches.

Abstract

With the advancement of multimedia internet, the impact of visual characteristics on the decision of users to click or not within the online retail industry is increasingly significant. Thus, incorporating visual features is a promising direction for further performance improvements in click-through rate (CTR). However, experiments on our production system revealed that simply injecting the image embeddings trained with established pre-training methods only has marginal improvements. We believe that the main advantage of existing image feature pre-training methods lies in their effectiveness for cross-modal predictions. However, this differs significantly from the task of CTR prediction in recommendation systems. In recommendation systems, other modalities of information (such as text) can be directly used as features in downstream models. Even if the performance of cross-modal prediction tasks is excellent, it is challenging to provide significant information gain for the downstream models. We argue that a visual feature pre-training method tailored for recommendation is necessary for further improvements beyond existing modality features. To this end, we propose an effective user intention reconstruction module to mine visual features related to user interests from behavior histories, which constructs a many-to-one correspondence. We further propose a contrastive training method to learn the user intentions and prevent the collapse of embedding vectors. We conduct extensive experimental evaluations on public datasets and our production system to verify that our method can learn users' visual interests. Our method achieves $0.46\%$ improvement in offline AUC and $0.88\%$ improvement in Taobao GMV (Cross Merchandise Volume) with p-value$<$0.01.

COURIER: Contrastive User Intention Reconstruction for Large-Scale Visual Recommendation

TL;DR

COURIER introduces a contrastive, user-intention reconstruction framework that learns visual item representations by reconstructing a user’s next clicked item from their history using a cross-attention mechanism. The method creates a many-to-one correspondence between history images and current items and employs a specialized contrastive objective to prevent embedding collapse, yielding improvements in offline AUC () and online GMV () on Taobao. Empirical results show strong gains on public datasets and robust improvements in the Taobao production system, with practical deployment details addressing large-scale training and serving. The approach demonstrates that user-behavior–driven visual features can more effectively capture downstream CTR signals than standard cross-modal pre-training approaches.

Abstract

With the advancement of multimedia internet, the impact of visual characteristics on the decision of users to click or not within the online retail industry is increasingly significant. Thus, incorporating visual features is a promising direction for further performance improvements in click-through rate (CTR). However, experiments on our production system revealed that simply injecting the image embeddings trained with established pre-training methods only has marginal improvements. We believe that the main advantage of existing image feature pre-training methods lies in their effectiveness for cross-modal predictions. However, this differs significantly from the task of CTR prediction in recommendation systems. In recommendation systems, other modalities of information (such as text) can be directly used as features in downstream models. Even if the performance of cross-modal prediction tasks is excellent, it is challenging to provide significant information gain for the downstream models. We argue that a visual feature pre-training method tailored for recommendation is necessary for further improvements beyond existing modality features. To this end, we propose an effective user intention reconstruction module to mine visual features related to user interests from behavior histories, which constructs a many-to-one correspondence. We further propose a contrastive training method to learn the user intentions and prevent the collapse of embedding vectors. We conduct extensive experimental evaluations on public datasets and our production system to verify that our method can learn users' visual interests. Our method achieves improvement in offline AUC and improvement in Taobao GMV (Cross Merchandise Volume) with p-value0.01.
Paper Structure (33 sections, 11 equations, 7 figures, 12 tables)

This paper contains 33 sections, 11 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: (a) Existing image feature learning methods are tailored for cross-modal prediction tasks. (b) We propose a user intention reconstruction method to mine potential visual features that cannot be reflected by cross-modal labels. In this example, the user searched for "Coat" and received two recommendations (Page-viewed items). The user clicked on the one on the right. Through our user intention reconstruction, we identified similar items from the user's click history with larger attention, the reconstructed PV item embeddings are denoted as $R_{pv}^j$. Then, we optimize the PV embeddings $E_{pv}^j$ and reconstructions $R_{pv}^j$ to be closer if the corresponding item is clicked and more far apart otherwise.
  • Figure 2: The contrastive user intention reconstruction method. The images are fed into the image backbone model to obtain the corresponding embeddings. The embeddings of PV (Page-View) sequences are blue-colored, and the embeddings of click sequences are yellow-colored. The reconstructions are in green. Red boxes denote positive PV items.
  • Figure 3: The impact of different values of temperature $\tau$ on the performance of downstream CTR tasks. The horizontal axis represents the values of $\tau$, while the vertical axis denotes the change (%) in the metrics.
  • Figure 4: The AUC improvements of COURIER compared to the Baseline on different categories. The x-axis is sorted by the improvements.
  • Figure 5: T-SNE visualization of embeddings in different categories.
  • ...and 2 more figures