Table of Contents
Fetching ...

Learning ID-free Item Representation with Token Crossing for Multimodal Recommendation

Kangning Zhang, Jiarui Jin, Yingjie Qin, Ruilong Su, Jianghao Lin, Yong Yu, Weinan Zhang

TL;DR

An ID-free MultimOdal TOken Representation scheme named MOTOR is proposed that represents each item using learnable multimodal tokens and connects them through shared tokens, reducing the overall space requirements of these models, facilitating information interaction among related items, while also significantly enhancing the model's recommendation capability.

Abstract

Current multimodal recommendation models have extensively explored the effective utilization of multimodal information; however, their reliance on ID embeddings remains a performance bottleneck. Even with the assistance of multimodal information, optimizing ID embeddings remains challenging for ID-based Multimodal Recommender when interaction data is sparse. Furthermore, the unique nature of item-specific ID embeddings hinders the information exchange among related items and the spatial requirement of ID embeddings increases with the scale of item. Based on these limitations, we propose an ID-free MultimOdal TOken Representation scheme named MOTOR that represents each item using learnable multimodal tokens and connects them through shared tokens. Specifically, we first employ product quantization to discretize each item's multimodal features (e.g., images, text) into discrete token IDs. We then interpret the token embeddings corresponding to these token IDs as implicit item features, introducing a new Token Cross Network to capture the implicit interaction patterns among these tokens. The resulting representations can replace the original ID embeddings and transform the original ID-based multimodal recommender into ID-free system, without introducing any additional loss design. MOTOR reduces the overall space requirements of these models, facilitating information interaction among related items, while also significantly enhancing the model's recommendation capability. Extensive experiments on nine mainstream models demonstrate the significant performance improvement achieved by MOTOR, highlighting its effectiveness in enhancing multimodal recommendation systems.

Learning ID-free Item Representation with Token Crossing for Multimodal Recommendation

TL;DR

An ID-free MultimOdal TOken Representation scheme named MOTOR is proposed that represents each item using learnable multimodal tokens and connects them through shared tokens, reducing the overall space requirements of these models, facilitating information interaction among related items, while also significantly enhancing the model's recommendation capability.

Abstract

Current multimodal recommendation models have extensively explored the effective utilization of multimodal information; however, their reliance on ID embeddings remains a performance bottleneck. Even with the assistance of multimodal information, optimizing ID embeddings remains challenging for ID-based Multimodal Recommender when interaction data is sparse. Furthermore, the unique nature of item-specific ID embeddings hinders the information exchange among related items and the spatial requirement of ID embeddings increases with the scale of item. Based on these limitations, we propose an ID-free MultimOdal TOken Representation scheme named MOTOR that represents each item using learnable multimodal tokens and connects them through shared tokens. Specifically, we first employ product quantization to discretize each item's multimodal features (e.g., images, text) into discrete token IDs. We then interpret the token embeddings corresponding to these token IDs as implicit item features, introducing a new Token Cross Network to capture the implicit interaction patterns among these tokens. The resulting representations can replace the original ID embeddings and transform the original ID-based multimodal recommender into ID-free system, without introducing any additional loss design. MOTOR reduces the overall space requirements of these models, facilitating information interaction among related items, while also significantly enhancing the model's recommendation capability. Extensive experiments on nine mainstream models demonstrate the significant performance improvement achieved by MOTOR, highlighting its effectiveness in enhancing multimodal recommendation systems.

Paper Structure

This paper contains 30 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: MOTOR transforms the original ID-based recommendation model into an ID-free framework by removing the ID Embedding Table and maintaining modal-specific Token Embedding Tables with fewer parameters. Consider a scenario where items 0 and 1 are popular items, while item 2 is a cold-start item. MOTOR establishes connections among these items through shared tokens in either the image or text modality. Specifically, items 0, 1, and 2 are linked by a common image token 0, and items 1 and 2 are connected through a shared text token 3. Consequently, the cold-start item 2 can enhance its representation through associations with other related items.
  • Figure 2: A high-level illustration of MOTOR. The dashed line over the Item ID Embedding indicates that MOTOR replaces the original ID embeddings with the learned Item Token Representations. The core components of MOTOR include Feature Discretization (Section \ref{['sec:Token Learning']}), Token Embeddings (Section \ref{['sec:token embeddings']}), and Token Cross Network (Section \ref{['sec:Token Cross Network']}). In the diagram, the Token Cross Network on the right is Modal-agnostic, which performs a holistic cross-fusion of tokens from all modalities.
  • Figure 3: The Recall@20 (a) and NDCG@20 (b) of four models for items with diverse interaction counts. The shaded areas represent the performance improvement of MOTOR-enhanced (ID-free) models compared to the original (ID-based) models.
  • Figure 4: The Trainable Paremeters (left, bar) and Performance of Recall@20 (right, polyline) under the different number of tokens setting.
  • Figure 5: We randomly select an item from specific dataset as the query and retrieve the top two items most similar to its tokens. Discrete tokens preserve crucial semantic information inherent in original multimodal features.
  • ...and 1 more figures