Table of Contents
Fetching ...

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

Jiawei Chen, Tianzhuo Yang, Guoxi Zhang, Jiaming Ji, Yaodong Yang, Juntao Dai

TL;DR

This work proposes VISA (Value Injection via Shielded Adaptation), a closed-loop framework designed to navigate the alignment tax while staying loyal to the original knowledge, significantly outperforming both standard fine-tuning methods and prompting-based baselines, including GPT-4o.

Abstract

Aligning Large Language Models (LLMs) with nuanced human values remains a critical challenge, as existing methods like Reinforcement Learning from Human Feedback (RLHF) often handle only coarse-grained attributes. In practice, fine-tuning LLMs on task-specific datasets to optimize value alignment inevitably incurs an alignment tax: the model's pre-calibrated value system drifts significantly due to latent bias absorption from training data, while the fine-tuning process also causes severe hallucinations and semantic information loss in generated responses. To address this, we propose VISA (Value Injection via Shielded Adaptation), a closed-loop framework designed to navigate this trade-off. VISA's architecture features a high-precision value detector, a semantic-to-value translator, and a core value-rewriter. The value-rewriter is trained via Group Relative Policy Optimization (GRPO) with a composite reward function that simultaneously optimizes for fine-grained value precision, and the preservation of semantic integrity. By learning an optimal policy to balance these competing objectives, VISA effectively mitigates the alignment tax while staying loyal to the original knowledge. Our experiments demonstrate that this approach enables precise control over a model's value expression while maintaining its factual consistency and general capabilities, significantly outperforming both standard fine-tuning methods and prompting-based baselines, including GPT-4o.

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

TL;DR

This work proposes VISA (Value Injection via Shielded Adaptation), a closed-loop framework designed to navigate the alignment tax while staying loyal to the original knowledge, significantly outperforming both standard fine-tuning methods and prompting-based baselines, including GPT-4o.

Abstract

Aligning Large Language Models (LLMs) with nuanced human values remains a critical challenge, as existing methods like Reinforcement Learning from Human Feedback (RLHF) often handle only coarse-grained attributes. In practice, fine-tuning LLMs on task-specific datasets to optimize value alignment inevitably incurs an alignment tax: the model's pre-calibrated value system drifts significantly due to latent bias absorption from training data, while the fine-tuning process also causes severe hallucinations and semantic information loss in generated responses. To address this, we propose VISA (Value Injection via Shielded Adaptation), a closed-loop framework designed to navigate this trade-off. VISA's architecture features a high-precision value detector, a semantic-to-value translator, and a core value-rewriter. The value-rewriter is trained via Group Relative Policy Optimization (GRPO) with a composite reward function that simultaneously optimizes for fine-grained value precision, and the preservation of semantic integrity. By learning an optimal policy to balance these competing objectives, VISA effectively mitigates the alignment tax while staying loyal to the original knowledge. Our experiments demonstrate that this approach enables precise control over a model's value expression while maintaining its factual consistency and general capabilities, significantly outperforming both standard fine-tuning methods and prompting-based baselines, including GPT-4o.
Paper Structure (66 sections, 8 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 66 sections, 8 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: The phenomenon of Value Drift (the unintended shift of models' foundational values after knowledge fine-tuning). When a base model is fine-tuned on a knowledge-centric dataset, its value shifts sharply in the four-dimensional scale, demonstrating an undesirable value drift. Value dimensions are scored using methodology adapted from Zhang et al. zhang2025cultivatingpluralismalgorithmicmonoculture.
  • Figure 2: The VISA pipeline is designed to decouple knowledge preservation from value alignment. The User Workflow (top) shows the inference-time process: a user's textual instruction is interpreted by the Translator into a latent Value Shift Vector ($\Delta V$). Concurrently, the Detector analyzes the Original Response to extract its intrinsic Original Value Vector. These vectors are combined to form a precise Target Value. The core Rewriter model then conditions on the original response and this target vector to produce a new, value-aligned output. The Training Process (bottom) details how the Rewriter is optimized using GRPO. It learns to maximize a dual-objective reward signal, combining Value Reward (cosine similarity to the target value vector) and Consistency Reward (semantic entailment with the original response), thereby learning to inject values without hallucinating or losing factual information.
  • Figure 3: Human evaluation on value rewriting quality and precision. (a) Our model outperforms all baselines in pairwise preference comparisons. (b) In terms of value identification consistency, our model achieves the highest average match score (7.60/10) with the lowest variance.
  • Figure 4: Qualitative comparison of VISA and GPT-4o on a value rewriting task. VISA successfully injects the target values while maintaining high knowledge consistency, whereas the prompted GPT-4o achieves lower value cosine consistency and deviates from the original information. Refer to Section \ref{['sec:case']} for detailed analysis.
  • Figure 5: Performance comparison of alignment methods. (a) Semantic Consistency ($\uparrow$) and (b) Value L2 Distance ($\downarrow$) relative to the target vector $v^*$. (c) Joint Success Rate (JSR) across different model scales. Our method consistently outperforms baselines (SFT, DPO, SimPO) in balancing value injection and semantic preservation.
  • ...and 3 more figures