LongLaMP: A Benchmark for Personalized Long-form Text Generation
Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, Hamed Zamani
TL;DR
LongLaMP introduces a benchmark for personalized long-text generation across four tasks and two evaluation settings, enabling systematic evaluation of personalization in long-form content. It employs a retrieval-augmented generation framework to inject user profile information via query and prompt construction functions, comparing BM25 and Contriever retrievers. Zero-shot and fine-tuned experiments with GPT-3.5, LLaMA2, and FlanT5-base show consistent personalization benefits across emails, abstracts, reviews, and topic writing, with substantial gains in ROUGE-1, ROUGE-L, and METEOR. The work highlights the practical significance of personalization for long-form generation and provides open-source LongLaMP for future research and development.
Abstract
Long-text generation is seemingly ubiquitous in real-world applications of large language models such as generating an email or writing a review. Despite the fundamental importance and prevalence of long-text generation in many practical applications, existing work on personalized generation has focused on the generation of very short text. To overcome these limitations, we study the problem of personalized long-text generation, that is, generating long-text that is personalized for a specific user while being practically useful for the vast majority of real-world applications that naturally require the generation of longer text. In this work, we demonstrate the importance of user-specific personalization for long-text generation tasks and develop the Long-text Language Model Personalization (LongLaMP) Benchmark. LongLaMP provides a comprehensive and diverse evaluation framework for personalized long-text generation. Extensive experiments on LongLaMP for zero-shot and fine-tuned language tasks demonstrate the effectiveness of the proposed benchmark and its utility for developing and evaluating techniques for personalized long-text generation across a wide variety of long-text generation tasks. The results highlight the importance of personalization across a wide variety of long-text generation tasks. Finally, we release the benchmark for others to use for this important problem.
