Pearl: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers

Sheshera Mysore; Zhuoran Lu; Mengting Wan; Longqi Yang; Bahareh Sarrafzadeh; Steve Menezes; Tina Baghaee; Emmanuel Barajas Gonzalez; Jennifer Neville; Tara Safavi

Pearl: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers

Sheshera Mysore, Zhuoran Lu, Mengting Wan, Longqi Yang, Bahareh Sarrafzadeh, Steve Menezes, Tina Baghaee, Emmanuel Barajas Gonzalez, Jennifer Neville, Tara Safavi

TL;DR

Pearl tackles personalization in LLM writing assistants by learning a generation-calibrated retriever that selects a subset of historic user documents to augment prompts. It introduces a training-data selection method based on a differential likelihood from an auxiliary model and a scale-calibrated KL objective to align retriever scores with downstream generation quality. Empirically, Pearl matches or surpasses strong baselines on two social media datasets (WorkSm and AITA), and its retriever scores can function as a performance predictor, enabling selective revision to improve outputs. These results suggest practical gains for personalized writing tools and provide a framework for calibrated retrieval in user-specific generation tasks.

Abstract

Powerful large language models have facilitated the development of writing assistants that promise to significantly improve the quality and efficiency of composition and communication. However, a barrier to effective assistance is the lack of personalization in LLM outputs to the author's communication style, specialized knowledge, and values. In this paper, we address this challenge by proposing Pearl, a LLM writing assistant personalized with a retriever that is trained to be generation-calibrated for personalization. Generation calibration ensures that our retriever selects historic user authored documents to augment an LLM prompt such that they are likely to help an LLM generation better adhere to a users' preferences. We propose two key novelties for training such a retriever: (1) A training data selection method that identifies user requests likely to benefit from personalization and documents that provide that benefit; and (2) A scale-calibrating KL-divergence objective that ensures that our retriever scores remain proportional to the downstream generation quality from using the document for personalized generation. In a series of holistic evaluations, we demonstrate the effectiveness of Pearl in generating long-form texts on multiple social media datasets. Finally, we demonstrate how a generation-calibrated retriever can double as a performance predictor -- detecting low quality retrieval, and improving potentially under-performing outputs via revision with LLMs.

Pearl: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers

TL;DR

Abstract

Paper Structure (27 sections, 2 equations, 9 figures, 14 tables, 1 algorithm)

This paper contains 27 sections, 2 equations, 9 figures, 14 tables, 1 algorithm.

Introduction
Related Work
Problem Definition
Proposed Approach
Training Data Setup
Training Data Selection
Retriever Optimization
System Details
Experiments
Experimental Setup
Generation Evaluation
Selective Revision with Pearl
Conclusion
Ethical and broader impact
Model Details
...and 12 more sections

Figures (9)

Figure 1: Pearl is a request-driven generation model that personalizes LLM outputs through retrieval augmentation with a generation calibrated retriever.
Figure 2: To train retriever, $f_{\textrm{retr}}$, an auxiliary language model is first used to identify historical requests that can be personalized and the best document to use for personalization ①. Then, $f_{\textrm{retr}}$ is trained on the selected data with a scale calibrating loss function ②. Given an unseen request, $f_{\textrm{retr}}$ is used to select the best instances from historical texts for augmenting an LLM prompt for personalized generation ③. Our training results in a generation calibrated retriever where scores for documents are proportional to the quality of the LLM output.
Figure 3: Generation calibration of $f_{\textrm{retr}}$ allows us to use its predicted scores for performance prediction and selectively revise potentially bad generations.
Figure 4: A qualitative example illustrating the effectiveness of PEARL on AITA: Given a request post $q_u$ describing an ambiguous interpersonal situation regarding sharing medical information, PEARL retrieves a historical user comment $d_u$ that demonstrates the user's characteristic tone and values, and generates a comment $t_u$ highly similar to the ground-truth user comment $t_u^*$. We bold qualitatively similar phrases about individual liberties and italicize phrases about self-care and mental health. All texts are abbreviated for space, and provided in full in Appendix \ref{['supp-additional-res']}.
Figure 5: $f_\textrm{LLM}$ prompt used to for selective revision given a Stage 1 draft for AITA.
...and 4 more figures

Pearl: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers

TL;DR

Abstract

Pearl: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers

Authors

TL;DR

Abstract

Table of Contents

Figures (9)