Table of Contents
Fetching ...

Solving cold start in news recommendations: a RippleNet-based system for large scale media outlet

Karol Radziszewski, Michał Szpunar, Piotr Ociepka, Mateusz Buczyński

TL;DR

This work targets the persistent cold-start problem in news recommendations by augmenting RippleNet with semantic embeddings from large language models to better represent newly published items. It contributes a production-oriented pipeline deployed on SageMaker with Airflow-driven data flows and a richly described golden dataset in the Polish news domain. Offline and online evaluations show that while the RippleNet+LLM hybrid captures semantic relationships, it does not yet outperform a production baseline in real-world deployment, and online results reveal negative engagement effects. The study demonstrates the potential of knowledge-graph–driven approaches for rapidly changing content while outlining clear directions for improving generalization and production readiness.

Abstract

We present a scalable recommender system implementation based on RippleNet, tailored for the media domain with a production deployment in Onet.pl, one of Poland's largest online media platforms. Our solution addresses the cold-start problem for newly published content by integrating content-based item embeddings into the knowledge propagation mechanism of RippleNet, enabling effective scoring of previously unseen items. The system architecture leverages Amazon SageMaker for distributed training and inference, and Apache Airflow for orchestrating data pipelines and model retraining workflows. To ensure high-quality training data, we constructed a comprehensive golden dataset consisting of user and item features and a separate interaction table, all enabling flexible extensions and integration of new signals.

Solving cold start in news recommendations: a RippleNet-based system for large scale media outlet

TL;DR

This work targets the persistent cold-start problem in news recommendations by augmenting RippleNet with semantic embeddings from large language models to better represent newly published items. It contributes a production-oriented pipeline deployed on SageMaker with Airflow-driven data flows and a richly described golden dataset in the Polish news domain. Offline and online evaluations show that while the RippleNet+LLM hybrid captures semantic relationships, it does not yet outperform a production baseline in real-world deployment, and online results reveal negative engagement effects. The study demonstrates the potential of knowledge-graph–driven approaches for rapidly changing content while outlining clear directions for improving generalization and production readiness.

Abstract

We present a scalable recommender system implementation based on RippleNet, tailored for the media domain with a production deployment in Onet.pl, one of Poland's largest online media platforms. Our solution addresses the cold-start problem for newly published content by integrating content-based item embeddings into the knowledge propagation mechanism of RippleNet, enabling effective scoring of previously unseen items. The system architecture leverages Amazon SageMaker for distributed training and inference, and Apache Airflow for orchestrating data pipelines and model retraining workflows. To ensure high-quality training data, we constructed a comprehensive golden dataset consisting of user and item features and a separate interaction table, all enabling flexible extensions and integration of new signals.

Paper Structure

This paper contains 16 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: RippleNet's ripple sets in a knowledge graph from exemplary dataset
  • Figure 2: Histogram of cosine similarity between matched and real embedding
  • Figure 3: The pipeline illustrates the automated training and deployment process for the RippleNet model.