UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches

Chao Wang; Neo Wu; Lin Ning; Jiaxing Wu; Luyang Liu; Jun Xie; Shawn O'Banion; Bradley Green

UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches

Chao Wang, Neo Wu, Lin Ning, Jiaxing Wu, Luyang Liu, Jun Xie, Shawn O'Banion, Bradley Green

TL;DR

This work introduces \UserSumBench, a benchmark framework designed to facilitate iterative development of LLM-based summarization approaches and offers two key components: a reference-free summary quality metric and a novel robust summarization method that leverages time-hierarchical summarizer and self-critique verifier to produce high-quality summaries while eliminating hallucination.

Abstract

Large language models (LLMs) have shown remarkable capabilities in generating user summaries from a long list of raw user activity data. These summaries capture essential user information such as preferences and interests, and therefore are invaluable for LLM-based personalization applications, such as explainable recommender systems. However, the development of new summarization techniques is hindered by the lack of ground-truth labels, the inherent subjectivity of user summaries, and human evaluation which is often costly and time-consuming. To address these challenges, we introduce \UserSumBench, a benchmark framework designed to facilitate iterative development of LLM-based summarization approaches. This framework offers two key components: (1) A reference-free summary quality metric. We show that this metric is effective and aligned with human preferences across three diverse datasets (MovieLens, Yelp and Amazon Review). (2) A novel robust summarization method that leverages time-hierarchical summarizer and self-critique verifier to produce high-quality summaries while eliminating hallucination. This method serves as a strong baseline for further innovation in summarization techniques.

UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 5 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Related Works
UserSumBench Framework
Benchmark Metrics
Quality Metric
Instruction Following Metric
Information Density Metric
Hierarchy-Critique Summary Generation
Evaluation
Validating Benchmark Metrics
Datasets and Evaluation Tasks
Quality Metric vs. Human Ratings
Evaluating Summarization Approaches
Conclusion and Future Work
Prompt Examples
...and 8 more sections

Figures (5)

Figure 1: Evaluating summary quality through future activity prediction tasks, where LLMs predict the most likely user queries based on generated summaries of past activities.
Figure 2: Comparison of next product prediction accuracy across different contexts in the Amazon Review dataset, contrasting performance using raw timelines versus summarized data.
Figure 3: Comparison of summarization approaches: (1) Single-step summarization approach, where a summary is generated directly from the user's activity history; (2) Time-hierarchical and self-critique summarization approach, which involves segmenting the user's activity history over time, summarizing each segment, and iteratively refining the summaries before combining them into a final summary.
Figure 4: Diagram of the iterative summarization refinement process, where the LLM Summarizer generates an initial segment summary, which is then iteratively critiqued and refined by the LLM Verifier until an optimized summary is achieved or a specified iteration threshold is met.
Figure 5: Comparison of different models (Gemini 1.5 Pro, GPT-4o, and Claude 3 Haiku) on various metrics across the MovieLens, Yelp, and Amazon Review datasets: (1) Quality Metric, (2) Instruction Following Metric, and (3) Information Density Metric.

UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches

TL;DR

Abstract

UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches

Authors

TL;DR

Abstract

Table of Contents

Figures (5)