Table of Contents
Fetching ...

Value Drifts: Tracing Value Alignment During LLM Post-Training

Mehar Bhatia, Shravan Nayak, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Vered Shwartz, Siva Reddy

TL;DR

This work investigates how LLMs acquire and preserve human-value alignment during post-training by tracing value drifts across SFT and subsequent preference optimization. It introduces V-PRISM to measure stance-based values and demonstrates that SFT largely sets the model's value priors, while standard preference optimization induces minimal drift; however, a synthetic dataset with a controlled value-gap reveals that the choice of optimization algorithm can reshape values. The findings highlight the importance of data curation and algorithm selection in post-training pipelines and offer actionable guidance for improving alignment to human values. The study also discusses ethical considerations, limitations of stance-based proxies, and the need for broader data coverage to avoid culturally narrow conclusions.

Abstract

As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.

Value Drifts: Tracing Value Alignment During LLM Post-Training

TL;DR

This work investigates how LLMs acquire and preserve human-value alignment during post-training by tracing value drifts across SFT and subsequent preference optimization. It introduces V-PRISM to measure stance-based values and demonstrates that SFT largely sets the model's value priors, while standard preference optimization induces minimal drift; however, a synthetic dataset with a controlled value-gap reveals that the choice of optimization algorithm can reshape values. The findings highlight the importance of data curation and algorithm selection in post-training pipelines and offer actionable guidance for improving alignment to human values. The study also discusses ethical considerations, limitations of stance-based proxies, and the need for broader data coverage to avoid culturally narrow conclusions.

Abstract

As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.

Paper Structure

This paper contains 71 sections, 1 equation, 17 figures, 17 tables.

Figures (17)

  • Figure 1: Post-training can cause value drift, shifting the stance of model generations from a neutral to support, when asked a value-probing question such as "Should we close the gates and stop immigration?" In this paper, we analyze how post-training reshapes these values.
  • Figure 2: SFT-induced values for Llama‑3-3B and Qwen‑3-4B models trained on WildChat and Alpaca for the topic of immigration. Each line represents the mean stance probability of support, neutral, and oppose stances, with 95% confidence intervals. In all cases, SFT leads to changes in stance distribution, often very early in training; WildChat leads to a high proportion of neutral responses, while on Alpaca leads to a higher proportion of responses supporting immigration.
  • Figure 3: Values on the topic of abortion induced by training Llama3-3B-SFT-WildChat on UltraFeedback. Each line represents the mean stance probability of support, neutral, and oppose stances, with 95% confidence intervals. Across PPO, DPO, and SimPO, stance distributions remain stable after SFT, suggesting preference optimization leads to minimal to no value drifts.
  • Figure 4: Value drifts induced by different preference optimization algorithms. Each line represents the mean stance probability of support, neutral, and oppose stances, with 95% confidence intervals.
  • Figure 5: Prompt used to elicit stance distribution for each generated response.
  • ...and 12 more figures