Table of Contents
Fetching ...

Retrieval Augmented Generation with Collaborative Filtering for Personalized Text Generation

Teng Shi, Jun Xu, Xiao Zhang, Xiaoxue Zang, Kai Zheng, Yang Song, Han Li

TL;DR

This work addresses personalized LLM generation by introducing CFRAG, which extends Retrieval-Augmented Generation with collaborative filtering. It learns user embeddings via contrastive learning to identify top-$m$ similar users and then extracts top-$k$ documents from these histories using a personalized retriever and reranker trained with LLM feedback. The retriever balances semantic relevance with user preference, while the reranker further aligns rankings to maximize generation quality, validated by KL-divergence-based training against LLM outputs. Experiments on the LaMP benchmark show CFRAG consistently outperforms baselines, with ablations confirming the necessity of collaborative information and LLM-driven fine-tuning, indicating practical benefits for scalable, privacy-conscious personalized text generation.

Abstract

Recently, the personalization of Large Language Models (LLMs) to generate content that aligns with individual user preferences has garnered widespread attention. Personalized Retrieval-Augmented Generation (RAG), which retrieves relevant documents from the user's history to reflect their preferences and enhance LLM generation, is one commonly used approach for personalization. However, existing personalized RAG methods do not consider that the histories of similar users can also assist in personalized generation for the current user, meaning that collaborative information between users can also benefit personalized generation. Inspired by the application of collaborative filtering in recommender systems, we propose a method called CFRAG, which adapts Collaborative Filtering to RAG for personalized text generation. However, this presents two challenges: (1)~how to incorporate collaborative information without explicit user similarity labels? (2)~how to retrieve documents that support personalized LLM generation? For Challenge 1, we use contrastive learning to train user embeddings to retrieve similar users and introduce collaborative information. For Challenge 2, we design a personalized retriever and reranker to retrieve the top-$k$ documents from these users' histories. We take into account the user's preference during retrieval and reranking. Then we leverage feedback from the LLM to fine-tune the personalized retriever and reranker, enabling them to retrieve documents that meet the personalized generation needs of the LLM. Experimental results on the Language Model Personalization (LaMP) benchmark validate the effectiveness of CFRAG. Further analysis confirms the importance of incorporating collaborative information.

Retrieval Augmented Generation with Collaborative Filtering for Personalized Text Generation

TL;DR

This work addresses personalized LLM generation by introducing CFRAG, which extends Retrieval-Augmented Generation with collaborative filtering. It learns user embeddings via contrastive learning to identify top- similar users and then extracts top- documents from these histories using a personalized retriever and reranker trained with LLM feedback. The retriever balances semantic relevance with user preference, while the reranker further aligns rankings to maximize generation quality, validated by KL-divergence-based training against LLM outputs. Experiments on the LaMP benchmark show CFRAG consistently outperforms baselines, with ablations confirming the necessity of collaborative information and LLM-driven fine-tuning, indicating practical benefits for scalable, privacy-conscious personalized text generation.

Abstract

Recently, the personalization of Large Language Models (LLMs) to generate content that aligns with individual user preferences has garnered widespread attention. Personalized Retrieval-Augmented Generation (RAG), which retrieves relevant documents from the user's history to reflect their preferences and enhance LLM generation, is one commonly used approach for personalization. However, existing personalized RAG methods do not consider that the histories of similar users can also assist in personalized generation for the current user, meaning that collaborative information between users can also benefit personalized generation. Inspired by the application of collaborative filtering in recommender systems, we propose a method called CFRAG, which adapts Collaborative Filtering to RAG for personalized text generation. However, this presents two challenges: (1)~how to incorporate collaborative information without explicit user similarity labels? (2)~how to retrieve documents that support personalized LLM generation? For Challenge 1, we use contrastive learning to train user embeddings to retrieve similar users and introduce collaborative information. For Challenge 2, we design a personalized retriever and reranker to retrieve the top- documents from these users' histories. We take into account the user's preference during retrieval and reranking. Then we leverage feedback from the LLM to fine-tune the personalized retriever and reranker, enabling them to retrieve documents that meet the personalized generation needs of the LLM. Experimental results on the Language Model Personalization (LaMP) benchmark validate the effectiveness of CFRAG. Further analysis confirms the importance of incorporating collaborative information.

Paper Structure

This paper contains 35 sections, 19 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: An example from the LaMP-4 dataset salemi2023lamp. The task of LaMP-4 is to generate personalized news headlines based on user input. This example illustrates the benefit of collaborative information for LLM personalization: (a) The top shows results retrieved by the existing RAG method from the current user's history, where we can only infer that "She" in the user's input refers to "Hillary Clinton'‘. (b) The bottom shows results retrieved by our method from similar users' histories, allowing us to infer further that "his" in the user's input refers to "Donald Trump" thus enabling the generation of a more accurate result.
  • Figure 2: The architecture of CFRAG. From left to right: (a) User Retrieval retrieves similar users (Section \ref{['sec:user_retrieval']}); (b) Retriever retrieves the top-$k$ documents from each user's history (Section \ref{['sec:doc_retrieval']}); (c) Reranker reranks the $m\times k$ documents to get the final top-$k$ documents, which are then concatenated with the query and input into the LLM for personalized text generation (Section \ref{['sec:doc_rerank']}).
  • Figure 3: Contrastive learning for user embedding training.
  • Figure 4: The method of training the retriever and reranker using LLM feedback.
  • Figure 5: Results of using different methods to select users for introducing collaborative information. "random" indicates randomly selecting $m$ users; "top-($m$-$2m$)" represents selecting users whose similarity to the current user ranks between $m$ and $2m$; "top-$m$" indicates selecting the most similar $m$ users.
  • ...and 4 more figures