Table of Contents
Fetching ...

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

Jiajun Chen, Hua Shen

TL;DR

This work introduces Value Alignment Tax (VAT) to quantify how alignment efforts reshape interdependent human values in LLMs, moving beyond static, target-centric evaluations. By modeling value states from context-conditioned judgments and analyzing their co-variation through gain-normalized metrics and coupling matrices, the authors reveal structured, system-level shifts and identify coordination hubs within the Schwartz value circumplex. They develop a sequential, two-stage data construction pipeline and demonstrate across four models and multiple alignment strategies that similar on-target gains can produce divergent alignment taxes and stability profiles. The findings highlight systemic risks and provide a framework for tax-aware alignment, with implications for safer, more controllable deployment of LLMs in normative domains.

Abstract

Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value system. We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the dynamics of value expression under alignment pressure. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and alignment strategies. Our results show that alignment often produces uneven, structured co-movement among values. These effects are invisible under conventional target-only evaluation, revealing systemic, process-level alignment risks and offering new insights into the dynamics of value alignment in LLMs.

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

TL;DR

This work introduces Value Alignment Tax (VAT) to quantify how alignment efforts reshape interdependent human values in LLMs, moving beyond static, target-centric evaluations. By modeling value states from context-conditioned judgments and analyzing their co-variation through gain-normalized metrics and coupling matrices, the authors reveal structured, system-level shifts and identify coordination hubs within the Schwartz value circumplex. They develop a sequential, two-stage data construction pipeline and demonstrate across four models and multiple alignment strategies that similar on-target gains can produce divergent alignment taxes and stability profiles. The findings highlight systemic risks and provide a framework for tax-aware alignment, with implications for safer, more controllable deployment of LLMs in normative domains.

Abstract

Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value system. We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the dynamics of value expression under alignment pressure. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and alignment strategies. Our results show that alignment often produces uneven, structured co-movement among values. These effects are invisible under conventional target-only evaluation, revealing systemic, process-level alignment risks and offering new insights into the dynamics of value alignment in LLMs.
Paper Structure (67 sections, 12 equations, 13 figures, 6 tables)

This paper contains 67 sections, 12 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Illustration of Value Alignment Tax. Traditional trait-level evaluation reports independent value scores, whereas VAT elicits state-level value configurations and models values as a relational system, revealing alignment-induced trade-offs. Edge direction denotes influence; width indicates trade-off magnitude.
  • Figure 2: Value-level alignment coupling under different steering objectives.Top row: Normalized VAT$(v)$/nVAT profiles (radar plots) showing value participation strength under each steering objective. Bottom row: Corresponding value--value coupling structures (chord diagrams; top-$|R_{uv}|$ edges, 8-shot). Red indicates strong positive coupling; blue indicates strong negative coupling.
  • Figure 3: Trade-off between target value gain and system-level alignment tax (nVAT) across SFT and DPO checkpoints when suppressing Power. Dashed lines indicate Pareto-efficient alignment regimes.
  • Figure 4: Value-level alignment tax projected onto the Schwartz circumplex. Colors denote steered values, line styles indicate alignment strength, and node opacity reflects stability across shots.
  • Figure 5: Alignment-induced risk amplification. Distribution of value-level amplification (VAT$(v)$) for coordination hubs (high-VAT values) and non-hubs under different steering objectives (GPT-4o, 8-shot).
  • ...and 8 more figures