Table of Contents
Fetching ...

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

Zeyu Zhang, Xiangxiang Dai, Ziyi Han, Xutong Liu, John C. S. Lui

Abstract

Large language models (LLMs) are typically governed by post-training alignment (e.g., RLHF or DPO), which yields a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle problem: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic, time-varying safety norms. This motivates inference-time governance that steers behavior without costly retraining. To address this, we introduce the Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism: it pools data only within the intersection of utility and safety similarity graphs, effectively preventing unsafe generalization across semantically proximal but risk-divergent contexts. Our theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performance of CCLUB. Extensive experiments validate that CCLUB outperforms strong baselines, achieving a 10.98% improvement in cumulative reward and a 14.42% reduction in the average suboptimality gap.

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

Abstract

Large language models (LLMs) are typically governed by post-training alignment (e.g., RLHF or DPO), which yields a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle problem: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic, time-varying safety norms. This motivates inference-time governance that steers behavior without costly retraining. To address this, we introduce the Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism: it pools data only within the intersection of utility and safety similarity graphs, effectively preventing unsafe generalization across semantically proximal but risk-divergent contexts. Our theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performance of CCLUB. Extensive experiments validate that CCLUB outperforms strong baselines, achieving a 10.98% improvement in cumulative reward and a 14.42% reduction in the average suboptimality gap.
Paper Structure (47 sections, 16 theorems, 96 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 47 sections, 16 theorems, 96 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4.4

Suppose Assumptions assump:finite_arm, assump:prototype_process, and assump:clustered_linear hold. Then the cumulative regret of CCLUB satisfies where the first term captures the cost of cluster identification and the second matches the cluster-level regret. The detailed proof is given in Theorem thm:rccb-regret.

Figures (8)

  • Figure 1: Safety--utility trade-offs as weight $w$ varies.
  • Figure 2: Adding safety features improves separability (AUC $\uparrow$).
  • Figure 3: Overview of the CCLUB Framework for Adaptive Social Alignment. The system integrates an offline safety-aware initialization phase with an online consensus clustering mechanism to route prompts dynamically.
  • Figure 4: Performance comparison on online cumulative reward and offline deployment suboptimality gap.
  • Figure 5: Average reward vs. inference-time preference weight $w$ under different offline data ratios (warm-start at $w_{\mathrm{train}}=0.5$).
  • ...and 3 more figures

Theorems & Definitions (35)

  • Theorem 4.4: Regret Bound of CCLUB
  • Remark 4.5: Warm-start influence
  • Remark 8.1: Interpreting the identification term
  • Lemma 8.2: Concat gap implies objective gap
  • proof
  • Lemma 8.3: Anytime-Valid Coverage of Prototypes
  • proof
  • Lemma 8.4: Empirical Bernstein Bound
  • Remark 8.5: Unknown $p_{\min}$
  • Lemma 8.6: Design-matrix eigenvalue growth under uniform exploration
  • ...and 25 more