Table of Contents
Fetching ...

LARR: Large Language Model Aided Real-time Scene Recommendation with Semantic Understanding

Zhizhong Wan, Bin Yin, Junjie Xie, Fei Jiang, Xiang Li, Wei Lin

TL;DR

LARR tackles the gap between semantic scene understanding and collaborative signals in real-time food-delivery recommendations. It proposes a three-stage pipeline: domain-specific continual pretraining of an LLM, contrastive fine-tuning to convert the LLM into a text embedding model using three sample strategies, and a multimodal alignment stage that aggregates LLM outputs from multiple real-time scene features with a bidirectional encoder. By freezing the LLM during alignment and using an aggregation token, LARR achieves efficient real-time inference while leveraging rich semantic representations, demonstrated by strong offline gains and positive online impact on the Meituan Waimai dataset. The approach offers practical guidance for deploying LLM-enhanced CTR models in industry, balancing semantic depth with serving latency and scalability.

Abstract

Click-Through Rate (CTR) prediction is crucial for Recommendation System(RS), aiming to provide personalized recommendation services for users in many aspects such as food delivery, e-commerce and so on. However, traditional RS relies on collaborative signals, which lacks semantic understanding to real-time scenes. We also noticed that a major challenge in utilizing Large Language Models (LLMs) for practical recommendation purposes is their efficiency in dealing with long text input. To break through the problems above, we propose Large Language Model Aided Real-time Scene Recommendation(LARR), adopt LLMs for semantic understanding, utilizing real-time scene information in RS without requiring LLM to process the entire real-time scene text directly, thereby enhancing the efficiency of LLM-based CTR modeling. Specifically, recommendation domain-specific knowledge is injected into LLM and then RS employs an aggregation encoder to build real-time scene information from separate LLM's outputs. Firstly, a LLM is continual pretrained on corpus built from recommendation data with the aid of special tokens. Subsequently, the LLM is fine-tuned via contrastive learning on three kinds of sample construction strategies. Through this step, LLM is transformed into a text embedding model. Finally, LLM's separate outputs for different scene features are aggregated by an encoder, aligning to collaborative signals in RS, enhancing the performance of recommendation model.

LARR: Large Language Model Aided Real-time Scene Recommendation with Semantic Understanding

TL;DR

LARR tackles the gap between semantic scene understanding and collaborative signals in real-time food-delivery recommendations. It proposes a three-stage pipeline: domain-specific continual pretraining of an LLM, contrastive fine-tuning to convert the LLM into a text embedding model using three sample strategies, and a multimodal alignment stage that aggregates LLM outputs from multiple real-time scene features with a bidirectional encoder. By freezing the LLM during alignment and using an aggregation token, LARR achieves efficient real-time inference while leveraging rich semantic representations, demonstrated by strong offline gains and positive online impact on the Meituan Waimai dataset. The approach offers practical guidance for deploying LLM-enhanced CTR models in industry, balancing semantic depth with serving latency and scalability.

Abstract

Click-Through Rate (CTR) prediction is crucial for Recommendation System(RS), aiming to provide personalized recommendation services for users in many aspects such as food delivery, e-commerce and so on. However, traditional RS relies on collaborative signals, which lacks semantic understanding to real-time scenes. We also noticed that a major challenge in utilizing Large Language Models (LLMs) for practical recommendation purposes is their efficiency in dealing with long text input. To break through the problems above, we propose Large Language Model Aided Real-time Scene Recommendation(LARR), adopt LLMs for semantic understanding, utilizing real-time scene information in RS without requiring LLM to process the entire real-time scene text directly, thereby enhancing the efficiency of LLM-based CTR modeling. Specifically, recommendation domain-specific knowledge is injected into LLM and then RS employs an aggregation encoder to build real-time scene information from separate LLM's outputs. Firstly, a LLM is continual pretrained on corpus built from recommendation data with the aid of special tokens. Subsequently, the LLM is fine-tuned via contrastive learning on three kinds of sample construction strategies. Through this step, LLM is transformed into a text embedding model. Finally, LLM's separate outputs for different scene features are aggregated by an encoder, aligning to collaborative signals in RS, enhancing the performance of recommendation model.
Paper Structure (16 sections, 26 equations, 4 figures, 3 tables)

This paper contains 16 sections, 26 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: How LLM understands the real-time scenes and help Recommendation System work.
  • Figure 2: Model overview including 3 stages. In stage 1, The LLM undergoes a continual pretraining task on shop-related corpus; In stage 2, LLM is fine-tuned and transformed into a text embedding model via contrastive learning on 3 types of positive samples; In stage 3, alignment is applying on LLM's semantic embedding and RS's collaborative embedding for enhancing the performance of the recommendation results.
  • Figure 3: Positive pairs construction. Red curves represent alignment procedure.
  • Figure 4: Hyperparameter Analysis