LongLaMP: A Benchmark for Personalized Long-form Text Generation

Ishita Kumar; Snigdha Viswanathan; Sushrita Yerra; Alireza Salemi; Ryan A. Rossi; Franck Dernoncourt; Hanieh Deilamsalehy; Xiang Chen; Ruiyi Zhang; Shubham Agarwal; Nedim Lipka; Chien Van Nguyen; Thien Huu Nguyen; Hamed Zamani

LongLaMP: A Benchmark for Personalized Long-form Text Generation

Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, Hamed Zamani

TL;DR

LongLaMP introduces a benchmark for personalized long-text generation across four tasks and two evaluation settings, enabling systematic evaluation of personalization in long-form content. It employs a retrieval-augmented generation framework to inject user profile information via query and prompt construction functions, comparing BM25 and Contriever retrievers. Zero-shot and fine-tuned experiments with GPT-3.5, LLaMA2, and FlanT5-base show consistent personalization benefits across emails, abstracts, reviews, and topic writing, with substantial gains in ROUGE-1, ROUGE-L, and METEOR. The work highlights the practical significance of personalization for long-form generation and provides open-source LongLaMP for future research and development.

Abstract

Long-text generation is seemingly ubiquitous in real-world applications of large language models such as generating an email or writing a review. Despite the fundamental importance and prevalence of long-text generation in many practical applications, existing work on personalized generation has focused on the generation of very short text. To overcome these limitations, we study the problem of personalized long-text generation, that is, generating long-text that is personalized for a specific user while being practically useful for the vast majority of real-world applications that naturally require the generation of longer text. In this work, we demonstrate the importance of user-specific personalization for long-text generation tasks and develop the Long-text Language Model Personalization (LongLaMP) Benchmark. LongLaMP provides a comprehensive and diverse evaluation framework for personalized long-text generation. Extensive experiments on LongLaMP for zero-shot and fine-tuned language tasks demonstrate the effectiveness of the proposed benchmark and its utility for developing and evaluating techniques for personalized long-text generation across a wide variety of long-text generation tasks. The results highlight the importance of personalization across a wide variety of long-text generation tasks. Finally, we release the benchmark for others to use for this important problem.

LongLaMP: A Benchmark for Personalized Long-form Text Generation

TL;DR

Abstract

Paper Structure (37 sections, 10 figures, 19 tables)

This paper contains 37 sections, 10 figures, 19 tables.

Introduction
LongLaMP Benchmark
Problem Formulation
The LongLaMP Benchmark
LongLaMP-1: Personalized Email Completion.
LongLaMP-2: Personalized Abstract Generation.
LongLaMP-3: Personalized Review Writing.
LongLaMP-4: Personalized Topic Writing.
Dataset Splits and Evaluation
User Setting:
Temporal Setting:
Evaluation:
Framework
Experiments
Experimental Setup
...and 22 more sections

Figures (10)

Figure 1: Overview of the personalized long-text generation framework. Notably, for generating personalized text for a specific user $i$, the user provides input text $x$, and we leverage their user documents (e.g., review text) and attributes (e.g., ratings) to better personalize the generated text, which is provided as input to the retrieval model. The output is the personalized long-text generated for that specific user $i$ with the specific input $x$ along with their previous set of user documents and attribute information used to personalize the generated text in terms of style and content. Note $\phi_q$ and $\phi_p$ are query and prompt construction functions.
Figure 2: The relationship between number $k$ of retrieved profiles.
Figure 3: Personalized email completion task schema. The $input$ represents the input prompt containing the title and part of the email. The $output$ represents the email content. The $profile$ section captures previous user-authored emails.
Figure 4: Personalized abstract generation task schema. Note that input is the prompt for the generation question for the user, and output is the ground-truth generation for that specific user's input question. Further, profile (e.g., set of text documents and profile information for that user) is a (possibly) large set of text documents used by our retrieval model for generating personalized abstracts.
Figure 5: Structure of the Amazon Product Review dataset
...and 5 more figures

LongLaMP: A Benchmark for Personalized Long-form Text Generation

TL;DR

Abstract

LongLaMP: A Benchmark for Personalized Long-form Text Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)