Table of Contents
Fetching ...

CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

Hang Lv, Sheng Liang, Hao Wang, Hongchao Gu, Yaxiong Wu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen

TL;DR

CoSteer is proposed, a collaborative framework that enables tuning-free, real-time personalization via decoding-time adaptation, and generates high-quality personalized content, ensuring both effectiveness and computational efficiency.

Abstract

Personalization has become crucial for adapting models to the diverse and evolving needs of users across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment within a single model, they struggle to achieve both real-time and high-quality personalization under the resource and privacy constraints of personal devices. To address this challenge, we propose CoSteer, a collaborative framework that enables tuning-free, real-time personalization via decoding-time adaptation. By leveraging logit differences between context-aware and context-agnostic local small models, CoSteer steers cloud-based large models, ensuring effective personalization while preserving the large model's capabilities. Personalization is handled locally, with only final tokens sent to the cloud, maintaining both user context and system efficiency. Through extensive experiments across a wide range of tasks, we demonstrate that CoSteer generates high-quality personalized content, ensuring both effectiveness and computational efficiency. Our results highlight its robustness across models and environments, confirming its practical applicability in real-world scenarios.

CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

TL;DR

CoSteer is proposed, a collaborative framework that enables tuning-free, real-time personalization via decoding-time adaptation, and generates high-quality personalized content, ensuring both effectiveness and computational efficiency.

Abstract

Personalization has become crucial for adapting models to the diverse and evolving needs of users across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment within a single model, they struggle to achieve both real-time and high-quality personalization under the resource and privacy constraints of personal devices. To address this challenge, we propose CoSteer, a collaborative framework that enables tuning-free, real-time personalization via decoding-time adaptation. By leveraging logit differences between context-aware and context-agnostic local small models, CoSteer steers cloud-based large models, ensuring effective personalization while preserving the large model's capabilities. Personalization is handled locally, with only final tokens sent to the cloud, maintaining both user context and system efficiency. Through extensive experiments across a wide range of tasks, we demonstrate that CoSteer generates high-quality personalized content, ensuring both effectiveness and computational efficiency. Our results highlight its robustness across models and environments, confirming its practical applicability in real-world scenarios.

Paper Structure

This paper contains 73 sections, 19 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: Schematic illustration of CoSteer framework. (a) Task scenario: A user poses a question potentially requiring access to local personal context (e.g., user profile, interaction history). (b) Limitations of small locally-deployed language models: Direct inference with constrained model capability leads to suboptimal generation quality. (c) Challenges of cloud-based LLMs: Despite strong generalization, once LLMs are constrained from accessing local personal context, they result in misaligned or contextually disconnected outputs. (d) CoSteer: Optimizes LLM predictions through local delta steering, balancing the LLM’s broad knowledge with user-specific information.
  • Figure 2: Effect of $\alpha$ , $\beta$ , $\eta$ , $\lambda$ and iteration step $T$ on the Abstract Generation dataset using Qwen 7B-1.5B. Metric: ROUGE-L.
  • Figure 3: Detailed breakdown of wall-clock time per token. The system employs an asynchronous pipelining strategy: The Local SLM inference (Stream 2) is executed simultaneously with the Network/Cloud path (Stream 1). Since the local inference time ($\approx 30$ms) is typically shorter than the network round-trip ($\approx 40$ms) plus cloud inference ($\approx 40$ms), the local computational burden is effectively masked and does not affect the critical path latency. The primary bottlenecks are thus Network Transmission (b & e) and the local FTRL Optimization (d).
  • Figure 4: An example of Cogenesis dataset.
  • Figure 5: Examples of Longlamp.
  • ...and 3 more figures