Disentangling Preference Representation and Text Generation for Efficient Individual Preference Alignment
Jianfei Zhang, Jun Bai, Bei Li, Yanmeng Wang, Rumei Li, Chenghua Lin, Wenge Rong
TL;DR
This work tackles the challenge of aligning large language models to individual user preferences efficiently. It introduces a dual-track approach: Contrastive Language–Latent Pretraining (CLaP), which extends decoder-only LLMs with a probabilistic latent variable $z$ via a latent encoder $q(z|x,y)$ and a latent adapter $p(y|x,z)$ to disentangle representation from generation, and Latent Direct Preference Optimization (Latent DPO), which learns a personalized latent encoder $p_{\theta}(z|x)$ using offline responses and latent rewards. By applying DPO at the latent level rather than the full model, the method achieves substantial per-user training-time reductions (80–90%) while delivering alignment quality competitive with LoRA- or P-Tuning-based PEFT baselines. Across IMDB, DailyDialog, and TL;DR summarization tasks, Latent DPO demonstrates strong personalized performance and clear efficiency gains, with additional validation on Llama3-8B showing consistent trends. This work offers a scalable solution for individual preference alignment, enabling large-scale customization without prohibitive computational cost.
Abstract
Aligning Large Language Models (LLMs) with general human preferences has been proved crucial in improving the interaction quality between LLMs and human. However, human values are inherently diverse among different individuals, making it insufficient to align LLMs solely with general preferences. To address this, personalizing LLMs according to individual feedback emerges as a promising solution. Nonetheless, this approach presents challenges in terms of the efficiency of alignment algorithms. In this work, we introduce a flexible paradigm for individual preference alignment. Our method fundamentally improves efficiency by disentangling preference representation from text generation in LLMs. We validate our approach across multiple text generation tasks and demonstrate that it can produce aligned quality as well as or better than PEFT-based methods, while reducing additional training time for each new individual preference by $80\%$ to $90\%$ in comparison with them.
