Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

Zeyu Zhang; Xiangxiang Dai; Ziyi Han; Xutong Liu; John C. S. Lui

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

Zeyu Zhang, Xiangxiang Dai, Ziyi Han, Xutong Liu, John C. S. Lui

Abstract

Large language models (LLMs) are typically governed by post-training alignment (e.g., RLHF or DPO), which yields a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle problem: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic, time-varying safety norms. This motivates inference-time governance that steers behavior without costly retraining. To address this, we introduce the Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism: it pools data only within the intersection of utility and safety similarity graphs, effectively preventing unsafe generalization across semantically proximal but risk-divergent contexts. Our theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performance of CCLUB. Extensive experiments validate that CCLUB outperforms strong baselines, achieving a 10.98% improvement in cumulative reward and a 14.42% reduction in the average suboptimality gap.

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

Abstract

Paper Structure (47 sections, 16 theorems, 96 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 47 sections, 16 theorems, 96 equations, 8 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Problem Formulation
Interaction Protocol
Inference-Time Reward and Online Regret
Warm-Start and Offline Suboptimality Gap
Key Challenges
Methodology
Representations
Conservative Consensus Clustering
Prototype-Specific Ridge Statistics
Conservative Pooling via Intersection
Routing and Graph Refinement
Theoretical Analysis
Experiments
...and 32 more sections

Key Result

Theorem 4.4

Suppose Assumptions assump:finite_arm, assump:prototype_process, and assump:clustered_linear hold. Then the cumulative regret of CCLUB satisfies where the first term captures the cost of cluster identification and the second matches the cluster-level regret. The detailed proof is given in Theorem thm:rccb-regret.

Figures (8)

Figure 1: Safety--utility trade-offs as weight $w$ varies.
Figure 2: Adding safety features improves separability (AUC $\uparrow$).
Figure 3: Overview of the CCLUB Framework for Adaptive Social Alignment. The system integrates an offline safety-aware initialization phase with an online consensus clustering mechanism to route prompts dynamically.
Figure 4: Performance comparison on online cumulative reward and offline deployment suboptimality gap.
Figure 5: Average reward vs. inference-time preference weight $w$ under different offline data ratios (warm-start at $w_{\mathrm{train}}=0.5$).
...and 3 more figures

Theorems & Definitions (35)

Theorem 4.4: Regret Bound of CCLUB
Remark 4.5: Warm-start influence
Remark 8.1: Interpreting the identification term
Lemma 8.2: Concat gap implies objective gap
proof
Lemma 8.3: Anytime-Valid Coverage of Prototypes
proof
Lemma 8.4: Empirical Bernstein Bound
Remark 8.5: Unknown $p_{\min}$
Lemma 8.6: Design-matrix eigenvalue growth under uniform exploration
...and 25 more

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

Abstract

Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

Authors

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (35)