Panza: Design and Analysis of a Fully-Local Personalized Text Writing Assistant

Armand Nicolicioiu; Eugenia Iofinova; Andrej Jovanovic; Eldar Kurtic; Mahdi Nikdan; Andrei Panferov; Ilia Markov; Nir Shavit; Dan Alistarh

Panza: Design and Analysis of a Fully-Local Personalized Text Writing Assistant

Armand Nicolicioiu, Eugenia Iofinova, Andrej Jovanovic, Eldar Kurtic, Mahdi Nikdan, Andrei Panferov, Ilia Markov, Nir Shavit, Dan Alistarh

TL;DR

Panza presents a fully-local, privacy-preserving approach to personalizing long-form email writing by combining Reverse Instructions-based fine-tuning with Retrieval-Augmented Generation. The method enables on-device training using parameter-efficient adapters (RoSA/LoRA) and memory-aware techniques, including 4-bit quantization, to run on commodity GPUs and even CPU-only systems. Evaluations across multiple open-source backbones show that fine-tuning with personalized data substantially improves style fidelity and usefulness over baselines, with RAFT-influenced training enhancing content-aware retrieval. A key finding is that surprisingly small personal datasets (as few as 25–100 emails) can yield convincingly personalized models, highlighting both the feasibility and potential abuse risk of such capabilities. The work demonstrates practical on-device deployment via a Gmail extension and discusses trade-offs between BLEU/ROUGE and MAUVE in capturing paraphrasing quality and stylistic similarity, supported by human studies and an explicit privacy/safety discussion.

Abstract

The availability of powerful open-source large language models (LLMs) opens exciting use-cases, such as using personal data to fine-tune these models to imitate a user's unique writing style. Two key requirements for such assistants are personalization - in the sense that the assistant should recognizably reflect the user's own writing style - and privacy - users may justifiably be wary of uploading extremely personal data, such as their email archive, to a third-party service. In this paper, we present a new design and evaluation for such an automated assistant, for the specific use case of email generation, which we call Panza. Panza's personalization features are based on a combination of fine-tuning using a variant of the Reverse Instructions technique together with Retrieval-Augmented Generation (RAG). We demonstrate that this combination allows us to fine-tune an LLM to reflect a user's writing style using limited data, while executing on extremely limited resources, e.g. on a free Google Colab instance. Our key methodological contribution is the first detailed study of evaluation metrics for this personalized writing task, and of how different choices of system components--the use of RAG and of different fine-tuning approaches-impact the system's performance. Additionally, we demonstrate that very little data - under 100 email samples - are sufficient to create models that convincingly imitate humans. This finding showcases a previously-unknown attack vector in language models - that access to a small number of writing samples can allow a bad actor to cheaply create generative models that imitate a target's writing style. We are releasing the full Panza code as well as three new email datasets licensed for research use at https://github.com/IST-DASLab/PanzaMail.

Panza: Design and Analysis of a Fully-Local Personalized Text Writing Assistant

TL;DR

Abstract

Paper Structure (56 sections, 11 figures, 21 tables)

This paper contains 56 sections, 11 figures, 21 tables.

Introduction
Related Work.
Method
Finetuning via Reverse Instructions
Local Fine-Tuning
Finetuning on a single GPU.
Inference
Evaluation Protocol
Datasets
Metrics
Phrasing quality.
User-specific knowledge.
General knowledge.
Style transfer.
Discussion.
...and 41 more sections

Figures (11)

Figure 1: Panza's overall design. Given a set of emails produced by the user, we produce both a finetuning dataset (using Reverse Instructions) and retrieval augmented generation (RAG) database. The base model is first fine-tuned and then served with RAG.
Figure 2: Average performance of finetuning-based methods compared to pretrained baselines (Llama3-8B-Instruct with and without RAG at inference). All versions of finetuning (FFT, RoSA and LoRA) outperform the baselines; RoSA and FFT outperform LoRA.
Figure 3: Comparison between instruction-only fine-tuning and Retrieval-Augmented Fine-Tuning (RAFT) for RoSA on Llama3-8B, averaged across users.
Figure 4: Style comparison between models trained for different users. Each model, trained for a particular user, is used to generate emails for unseen instructions of all the users. We measure the MAUVE score between the generations and the original emails written by the user.
Figure 5: BLEU and MAUVE scores of Panza models trained on smaller subsets of the training data, relative to the maximum score attained for the dataset. 0 emails corresponds to the un-finetuned Llama3-8B-Instruct model. Scores are averaged across seven users.
...and 6 more figures

Panza: Design and Analysis of a Fully-Local Personalized Text Writing Assistant

TL;DR

Abstract

Panza: Design and Analysis of a Fully-Local Personalized Text Writing Assistant

Authors

TL;DR

Abstract

Table of Contents

Figures (11)