Table of Contents
Fetching ...

HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling

Junyi Chen, Lu Chi, Bingyue Peng, Zehuan Yuan

TL;DR

HLLM introduces a hierarchical two-LLM architecture to enhance sequential recommendations by converting rich item text into compact embeddings via an Item LLM and modeling user interests from those embeddings with a User LLM. The approach demonstrates that pre-trained LLM weights are valuable for recommendation, while task-specific fine-tuning remains essential, and it scales effectively to multi-billion-parameter configurations. Across PixelRec and Amazon Books, HLLM achieves state-of-the-art offline performance and confirms practical value through online A/B tests, with item embedding caching improving training and serving efficiency. The work highlights strong applicability for industrial deployments, balancing accuracy, scalability, and efficiency.

Abstract

Large Language Models (LLMs) have achieved remarkable success in various fields, prompting several studies to explore their potential in recommendation systems. However, these attempts have so far resulted in only modest improvements over traditional recommendation models. Moreover, three critical questions remain under-explored: firstly, the real value of LLMs' pre-trained weights, often considered to encapsulate world knowledge; secondly, the necessity of fine-tuning for recommendation tasks; lastly, whether LLMs can exhibit the same scalability benefits in recommendation systems as they do in other domains. In this paper, we propose a novel Hierarchical Large Language Model (HLLM) architecture designed to enhance sequential recommendation systems. Our approach employs a two-tier model: the first Item LLM extracts rich content features from the detailed text description of the item, while the second User LLM utilizes these features to predict users' future interests based on their interaction history. Extensive experiments demonstrate that our method effectively leverages the pre-trained capabilities of open-source LLMs, and further fine-tuning leads to significant performance boosts. Additionally, HLLM achieves excellent scalability, with the largest configuration utilizing 7B parameters for both item feature extraction and user interest modeling. Moreover, HLLM offers excellent training and serving efficiency, making it practical in real-world applications. Evaluations on two large-scale datasets, PixelRec and Amazon Reviews, show that HLLM achieves state-of-the-art results, outperforming traditional ID-based models by a wide margin. In online A/B testing, HLLM showcases notable gains, validating its practical impact in real-world recommendation scenarios. Codes are available at https://github.com/bytedance/HLLM.

HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling

TL;DR

HLLM introduces a hierarchical two-LLM architecture to enhance sequential recommendations by converting rich item text into compact embeddings via an Item LLM and modeling user interests from those embeddings with a User LLM. The approach demonstrates that pre-trained LLM weights are valuable for recommendation, while task-specific fine-tuning remains essential, and it scales effectively to multi-billion-parameter configurations. Across PixelRec and Amazon Books, HLLM achieves state-of-the-art offline performance and confirms practical value through online A/B tests, with item embedding caching improving training and serving efficiency. The work highlights strong applicability for industrial deployments, balancing accuracy, scalability, and efficiency.

Abstract

Large Language Models (LLMs) have achieved remarkable success in various fields, prompting several studies to explore their potential in recommendation systems. However, these attempts have so far resulted in only modest improvements over traditional recommendation models. Moreover, three critical questions remain under-explored: firstly, the real value of LLMs' pre-trained weights, often considered to encapsulate world knowledge; secondly, the necessity of fine-tuning for recommendation tasks; lastly, whether LLMs can exhibit the same scalability benefits in recommendation systems as they do in other domains. In this paper, we propose a novel Hierarchical Large Language Model (HLLM) architecture designed to enhance sequential recommendation systems. Our approach employs a two-tier model: the first Item LLM extracts rich content features from the detailed text description of the item, while the second User LLM utilizes these features to predict users' future interests based on their interaction history. Extensive experiments demonstrate that our method effectively leverages the pre-trained capabilities of open-source LLMs, and further fine-tuning leads to significant performance boosts. Additionally, HLLM achieves excellent scalability, with the largest configuration utilizing 7B parameters for both item feature extraction and user interest modeling. Moreover, HLLM offers excellent training and serving efficiency, making it practical in real-world applications. Evaluations on two large-scale datasets, PixelRec and Amazon Reviews, show that HLLM achieves state-of-the-art results, outperforming traditional ID-based models by a wide margin. In online A/B testing, HLLM showcases notable gains, validating its practical impact in real-world recommendation scenarios. Codes are available at https://github.com/bytedance/HLLM.
Paper Structure (30 sections, 3 equations, 5 figures, 14 tables)

This paper contains 30 sections, 3 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Architecture of Hierarchical Large Language Model. HLLM consists of two LLMs with non-shared parameters: Item LLM and User LLM. The Item LLM takes the text description of an item as input, appended with a special token [ITEM], and outputs the item embedding. The User LLM inputs the item embeddings of the user's historical interactions and predicts next item. All LLM parameters are trainable and optimized via next item prediction.
  • Figure 2: Two User LLM variants for discriminative recommendations.
  • Figure 3: Experiments of HLLM's performance at various data scales. Recall@5 and NDCG@5 are reported.
  • Figure 4: An overview of the online system.
  • Figure 5: Distribution of textual descriptions (flattening all attributes) and sequence lengths in Pixel200K, Pixel1M, Pixel8M, Amazon Book Reviews and industrial scenario. Since Pixel200K is randomly sampled from Pixel1M, their distributions are consistent. We truncate the sequence length to a maximum of 2,000 for industrial data, hence p90 is exactly 2,000.