Table of Contents
Fetching ...

General Item Representation Learning for Cold-start Content Recommendations

Jooeun Kim, Jinri Kim, Kwangeun Yeo, Eungi Kim, Kyoung-Woon On, Jonghwan Mun, Joonseok Lee

TL;DR

The paper tackles cold-start item recommendation by leveraging rich multimodal content signals rather than relying on user-item interactions alone. It introduces a domain/dataset-agnostic item content representation framework built on Transformer-based modality-specific encoders with flexible fusion strategies, trained end-to-end on user activity data only. Two training objectives are proposed: a rating ranking loss and an optional multimodal alignment loss to harmonize content modalities. Empirical results on movie and news benchmarks demonstrate state-of-the-art cold-start performance and good transferability across domains, while reducing dependence on large labeled classification data. Overall, the approach yields fine-grained item representations that better capture user tastes and support scalable deployment.

Abstract

Cold-start item recommendation is a long-standing challenge in recommendation systems. A common remedy is to use a content-based approach, but rich information from raw contents in various forms has not been fully utilized. In this paper, we propose a domain/data-agnostic item representation learning framework for cold-start recommendations, naturally equipped with multimodal alignment among various features by adopting a Transformer-based architecture. Our proposed model is end-to-end trainable completely free from classification labels, not just costly to collect but suboptimal for recommendation-purpose representation learning. From extensive experiments on real-world movie and news recommendation benchmarks, we verify that our approach better preserves fine-grained user taste than state-of-the-art baselines, universally applicable to multiple domains at large scale.

General Item Representation Learning for Cold-start Content Recommendations

TL;DR

The paper tackles cold-start item recommendation by leveraging rich multimodal content signals rather than relying on user-item interactions alone. It introduces a domain/dataset-agnostic item content representation framework built on Transformer-based modality-specific encoders with flexible fusion strategies, trained end-to-end on user activity data only. Two training objectives are proposed: a rating ranking loss and an optional multimodal alignment loss to harmonize content modalities. Empirical results on movie and news benchmarks demonstrate state-of-the-art cold-start performance and good transferability across domains, while reducing dependence on large labeled classification data. Overall, the approach yields fine-grained item representations that better capture user tastes and support scalable deployment.

Abstract

Cold-start item recommendation is a long-standing challenge in recommendation systems. A common remedy is to use a content-based approach, but rich information from raw contents in various forms has not been fully utilized. In this paper, we propose a domain/data-agnostic item representation learning framework for cold-start recommendations, naturally equipped with multimodal alignment among various features by adopting a Transformer-based architecture. Our proposed model is end-to-end trainable completely free from classification labels, not just costly to collect but suboptimal for recommendation-purpose representation learning. From extensive experiments on real-world movie and news recommendation benchmarks, we verify that our approach better preserves fine-grained user taste than state-of-the-art baselines, universally applicable to multiple domains at large scale.
Paper Structure (32 sections, 5 equations, 4 figures, 12 tables)

This paper contains 32 sections, 5 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Overall Architecture. $C$ content features are extracted for each item using modality-specific encoders. (A few examples are illustrated in Fig. \ref{['fig:model_encoders']}.) Then, the Feature Fusion Layer aggregates them into the final item representation $\mathbf{v}_j$, and the rating $\mathbf{R}_{ij}$ is predicted by taking dot product with the target user embedding $\mathbf{u}_i$, learned in the manner of collaborative filtering.
  • Figure 2: Examples of Modality-specific Encoders. From the left, we illustrate the image, video, and text encoders.
  • Figure 3: t-SNE Visualization of Learned Video Embeddings
  • Figure 4: Illustration of item embeddings used in the pairwise similarity analysis in Table \ref{['tab:example_users2']}.