Table of Contents
Fetching ...

Efficient and Responsible Adaptation of Large Language Models for Robust Top-k Recommendations

Kirandeep Kaur, Chirag Shah

TL;DR

This work tackles robustness gaps in recommender systems by marrying traditional RSs with large language models through a responsible, per-user task allocation strategy. It first identifies weak/inactive users using a dual criterion based on a sparsity index $S_I(u)$ and ranking performance $P(u)$, then employs in-context learning to prompt LLMs to rank weak users’ histories, while strong users rely on RS rankings. Across three real-world datasets and multiple baselines, the approach yields significant improvements in weak-user performance and overall robustness (about 12%), with substantial reductions in the number of weak users and manageable cost increases depending on the LLM used. The method demonstrates that open-source LLMs can achieve competitive results alongside closed models when deployed selectively, thereby enabling practical, responsible deployment of generative models in recommendation systems.

Abstract

Conventional recommendation systems (RSs) are typically optimized to enhance performance metrics uniformly across all training samples. This makes it hard for data-driven RSs to cater to a diverse set of users due to the varying properties of these users. The performance disparity among various populations can harm the model's robustness with respect to sub-populations. While recent works have shown promising results in adapting large language models (LLMs) for recommendation to address hard samples, long user queries from millions of users can degrade the performance of LLMs and elevate costs, processing times and inference latency. This challenges the practical applicability of LLMs for recommendations. To address this, we propose a hybrid task allocation framework that utilizes the capabilities of both LLMs and traditional RSs. By adopting a two-phase approach to improve robustness to sub-populations, we promote a strategic assignment of tasks for efficient and responsible adaptation of LLMs. Our strategy works by first identifying the weak and inactive users that receive a suboptimal ranking performance by RSs. Next, we use an in-context learning approach for such users, wherein each user interaction history is contextualized as a distinct ranking task and given to an LLM. We test our hybrid framework by incorporating various recommendation algorithms -- collaborative filtering and learning-to-rank recommendation models -- and two LLMs -- both open and close-sourced. Our results on three real-world datasets show a significant reduction in weak users and improved robustness of RSs to sub-populations $(\approx12\%)$ and overall performance without disproportionately escalating costs.

Efficient and Responsible Adaptation of Large Language Models for Robust Top-k Recommendations

TL;DR

This work tackles robustness gaps in recommender systems by marrying traditional RSs with large language models through a responsible, per-user task allocation strategy. It first identifies weak/inactive users using a dual criterion based on a sparsity index and ranking performance , then employs in-context learning to prompt LLMs to rank weak users’ histories, while strong users rely on RS rankings. Across three real-world datasets and multiple baselines, the approach yields significant improvements in weak-user performance and overall robustness (about 12%), with substantial reductions in the number of weak users and manageable cost increases depending on the LLM used. The method demonstrates that open-source LLMs can achieve competitive results alongside closed models when deployed selectively, thereby enabling practical, responsible deployment of generative models in recommendation systems.

Abstract

Conventional recommendation systems (RSs) are typically optimized to enhance performance metrics uniformly across all training samples. This makes it hard for data-driven RSs to cater to a diverse set of users due to the varying properties of these users. The performance disparity among various populations can harm the model's robustness with respect to sub-populations. While recent works have shown promising results in adapting large language models (LLMs) for recommendation to address hard samples, long user queries from millions of users can degrade the performance of LLMs and elevate costs, processing times and inference latency. This challenges the practical applicability of LLMs for recommendations. To address this, we propose a hybrid task allocation framework that utilizes the capabilities of both LLMs and traditional RSs. By adopting a two-phase approach to improve robustness to sub-populations, we promote a strategic assignment of tasks for efficient and responsible adaptation of LLMs. Our strategy works by first identifying the weak and inactive users that receive a suboptimal ranking performance by RSs. Next, we use an in-context learning approach for such users, wherein each user interaction history is contextualized as a distinct ranking task and given to an LLM. We test our hybrid framework by incorporating various recommendation algorithms -- collaborative filtering and learning-to-rank recommendation models -- and two LLMs -- both open and close-sourced. Our results on three real-world datasets show a significant reduction in weak users and improved robustness of RSs to sub-populations and overall performance without disproportionately escalating costs.
Paper Structure (17 sections, 6 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 6 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: An overview of our framework that uses task allocation to adapt LLMs responsibly. We compute each user's sparsity index ($S_I$), evaluate recommendations retrieved from RS using performance metric ($P(u_m)$), and plot $P(u_m)$ against $S_I$. Interaction histories of highly sparse users with low $P(u_m)$ are contextualized and given to LLM for ranking. Strong users receive RS recommendations, while weak users get LLM recommendations if LLM outperforms RS.
  • Figure 2: Instruction template for contextualizing interaction histories of weak users.
  • Figure 3: AUC vs Sparsity scatter plots for illustrating the performance (measured using AUC- x-axis) for all users in ML1M, ML100k and Book-Crossing (B-C) dataset on three different algorithms.
  • Figure 4: Comparative analysis of reduction in the count of weak users

Theorems & Definitions (1)

  • definition 1