Table of Contents
Fetching ...

Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

Teqi Hao, Xioayu Tan, Shaojie Shi, Yinghui Xu, Xihe Qiu

TL;DR

This paper addresses the difficulty of personalizing black-box LLMs without fine-tuning their parameters. It introduces Reflective Personalization Optimization (RPO), a generate-then-rewrite framework where a base model produces a generic answer and an external reflection module rewrites it to reflect user preferences, guided by a retrieved personalized context. Central innovations include Structured Rewriting Trajectories to make the personalization policy observable, a two-stage training pipeline (supervised fine-tuning followed by reinforcement learning with a progressive multi-context curriculum), and a model-agnostic, modular architecture that to date achieves state-of-the-art results on the LaMP personalization benchmarks. The framework demonstrates robust performance across classification, regression, and generation tasks, indicating strong potential for practical deployment in user-centric generation scenarios while preserving content integrity of the base model. Overall, RPO offers a scalable, interpretable path to controllable personalization for large black-box LLMs by explicitly modeling and learning the user’s reasoning for stylistic alignment.

Abstract

The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user's preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.

Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

TL;DR

This paper addresses the difficulty of personalizing black-box LLMs without fine-tuning their parameters. It introduces Reflective Personalization Optimization (RPO), a generate-then-rewrite framework where a base model produces a generic answer and an external reflection module rewrites it to reflect user preferences, guided by a retrieved personalized context. Central innovations include Structured Rewriting Trajectories to make the personalization policy observable, a two-stage training pipeline (supervised fine-tuning followed by reinforcement learning with a progressive multi-context curriculum), and a model-agnostic, modular architecture that to date achieves state-of-the-art results on the LaMP personalization benchmarks. The framework demonstrates robust performance across classification, regression, and generation tasks, indicating strong potential for practical deployment in user-centric generation scenarios while preserving content integrity of the base model. Overall, RPO offers a scalable, interpretable path to controllable personalization for large black-box LLMs by explicitly modeling and learning the user’s reasoning for stylistic alignment.

Abstract

The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user's preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.

Paper Structure

This paper contains 24 sections, 4 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: This diagram illustrates the workflow of RPO. The system first utilizes a Base model to generate an initial response without personalized information based on the user query. Subsequently, the reflection module integrates user profile information and refines the initial response through reasoning enhancement, ultimately generating output that aligns with the user's personalized needs.
  • Figure 2: This figure illustrates the training pipeline of the RPO framework, which encompasses three pivotal stages: (1) Data Preparation: The pipeline begins by generating a corpus of Structured Rewriting Trajectories. For each training instance, a powerful teacher model is prompted with a generic response and a user's historical example to externalize the latent reasoning process behind personalization. (2) SFT Stage: The reflection model is then trained on these trajectories via supervised fine-tuning. This stage aims to instill a foundational personalized reasoning policy by having the model learn the explicit, step-by-step logic demonstrated in the trajectories. (3) RL Stage: Finally, the model's policy is refined using reinforcement learning. This stage features a key innovation: a progressive multi-context curriculum. The model is trained on a varying number of user profile examples (from 2 to 6) to enhance its ability to generalize from diverse and noisy user histories. This process is optimized using the REINFORCE++_baseline method.
  • Figure 3: The figure illustrates how varying the number of shots influences RPO during the reinforcement learning phase. Results suggest that a progressive sampling strategy, which increases the shot count incrementally, leads to superior performance.