RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs

Jiaxing Wu; Lin Ning; Luyang Liu; Harrison Lee; Neo Wu; Chao Wang; Sushant Prakash; Shawn O'Banion; Bradley Green; Jun Xie

RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs

Jiaxing Wu, Lin Ning, Luyang Liu, Harrison Lee, Neo Wu, Chao Wang, Sushant Prakash, Shawn O'Banion, Bradley Green, Jun Xie

TL;DR

RLPF addresses the challenge of leveraging long, noisy user histories for LLM-based personalization by learning concise natural-language user summaries that maximize downstream task utility. It casts summarization as a Contextual Markov Decision Process and optimizes a policy $π_θ$ through reinforcement learning using a prediction-based reward from a frozen LLM, with a KL regularization term to prevent reward hacking. The reward combines a future-activity prediction signal with a length penalty, enabling controlled context length while preserving predictive power. Across four public datasets, RLPF achieves substantial gains in downstream utility (up to 22% over baselines) and intrinsic summary quality, while reducing context length by up to 74%, demonstrating strong generalization to unseen tasks and domains and offering a privacy-conscious approach to personalized AI systems.

Abstract

LLM-powered personalization agent systems employ Large Language Models (LLMs) to predict users' behavior from their past activities. However, their effectiveness often hinges on the ability to effectively leverage extensive, long user historical data due to its inherent noise and length of such data. Existing pretrained LLMs may generate summaries that are concise but lack the necessary context for downstream tasks, hindering their utility in personalization systems. To address these challenges, we introduce Reinforcement Learning from Prediction Feedback (RLPF). RLPF fine-tunes LLMs to generate concise, human-readable user summaries that are optimized for downstream task performance. By maximizing the usefulness of the generated summaries, RLPF effectively distills extensive user history data while preserving essential information for downstream tasks. Our empirical evaluation demonstrates significant improvements in both extrinsic downstream task utility and intrinsic summary quality, surpassing baseline methods by up to 22% on downstream task performance and achieving an up to 84.59% win rate on Factuality, Abstractiveness, and Readability. RLPF also achieves a remarkable 74% reduction in context length while improving performance on 16 out of 19 unseen tasks and/or datasets, showcasing its generalizability. This approach offers a promising solution for enhancing LLM personalization by effectively transforming long, noisy user histories into informative and human-readable representations.

RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs

TL;DR

through reinforcement learning using a prediction-based reward from a frozen LLM, with a KL regularization term to prevent reward hacking. The reward combines a future-activity prediction signal with a length penalty, enabling controlled context length while preserving predictive power. Across four public datasets, RLPF achieves substantial gains in downstream utility (up to 22% over baselines) and intrinsic summary quality, while reducing context length by up to 74%, demonstrating strong generalization to unseen tasks and domains and offering a privacy-conscious approach to personalized AI systems.

Abstract

Paper Structure (56 sections, 5 equations, 10 figures, 18 tables)

This paper contains 56 sections, 5 equations, 10 figures, 18 tables.

Introduction
Methodology
Problem Statement
Reinforcement Learning from Prediction Feedback
Reward Computation
Training Process
Experimental Details
Dataset
Data Generation
Evaluation Metrics
Extrinsic Utility
Intrinsic Quality
Training Details
Baselines
Results
...and 41 more sections

Figures (10)

Figure 1: Overview of RLPF. Left: Training process of RLPF, in which future activity will be used towards reward computation. Right: We assess RLPF on unseen downstream prediction tasks to demonstrate its generalizability and adaptability.
Figure 2: RLPF summaries consistently demonstrate superior performance in Future Activity Prediction, surpassing both other summarization techniques and the full user context ("All Activities"), while significantly reducing the required context length. ZS-nano2: Gemini Nano-2 Zero-Shot; ZS-CP: Gemini Nano-2 with Crafted Prompts; ZS-Pro: Gemini Pro Zero-Shot.
Figure 3: Impact of Different Target Lengths on MovieLens 2015. Percentage changes are calculated relative to "No Length Reward" condition (no maximum length constraint). Data on the right axis pertains to AutoEval, while the left axis corresponds to the remaining tasks.
Figure 4: RLPF is robust with various prompts. Top: Evaluation metric with different prompts for Summarization, Bottom: Evaluation metric with different prompts for Prediction during reward computation. Prediction task: Future activity prediction on MovieLens 2015.
Figure 5: Example Summary Comparison on Amazon Books. There are duplicate parts (orange) in Zero-Shot summary, and hallucinations in RLAIF summary (red), while summary generated by RLPF adheres to the facts highlighted in green. Activity data statistics (out of 50 book review activities): Fiction: 34, Thrillers: 12, Written by woman: 20.
...and 5 more figures

RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs

TL;DR

Abstract

RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (10)