Pearl: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers
Sheshera Mysore, Zhuoran Lu, Mengting Wan, Longqi Yang, Bahareh Sarrafzadeh, Steve Menezes, Tina Baghaee, Emmanuel Barajas Gonzalez, Jennifer Neville, Tara Safavi
TL;DR
Pearl tackles personalization in LLM writing assistants by learning a generation-calibrated retriever that selects a subset of historic user documents to augment prompts. It introduces a training-data selection method based on a differential likelihood from an auxiliary model and a scale-calibrated KL objective to align retriever scores with downstream generation quality. Empirically, Pearl matches or surpasses strong baselines on two social media datasets (WorkSm and AITA), and its retriever scores can function as a performance predictor, enabling selective revision to improve outputs. These results suggest practical gains for personalized writing tools and provide a framework for calibrated retrieval in user-specific generation tasks.
Abstract
Powerful large language models have facilitated the development of writing assistants that promise to significantly improve the quality and efficiency of composition and communication. However, a barrier to effective assistance is the lack of personalization in LLM outputs to the author's communication style, specialized knowledge, and values. In this paper, we address this challenge by proposing Pearl, a LLM writing assistant personalized with a retriever that is trained to be generation-calibrated for personalization. Generation calibration ensures that our retriever selects historic user authored documents to augment an LLM prompt such that they are likely to help an LLM generation better adhere to a users' preferences. We propose two key novelties for training such a retriever: (1) A training data selection method that identifies user requests likely to benefit from personalization and documents that provide that benefit; and (2) A scale-calibrating KL-divergence objective that ensures that our retriever scores remain proportional to the downstream generation quality from using the document for personalized generation. In a series of holistic evaluations, we demonstrate the effectiveness of Pearl in generating long-form texts on multiple social media datasets. Finally, we demonstrate how a generation-calibrated retriever can double as a performance predictor -- detecting low quality retrieval, and improving potentially under-performing outputs via revision with LLMs.
