Value Alignment from Unstructured Text

Inkit Padhi; Karthikeyan Natesan Ramamurthy; Prasanna Sattigeri; Manish Nagireddy; Pierre Dognin; Kush R. Varshney

Value Alignment from Unstructured Text

Inkit Padhi, Karthikeyan Natesan Ramamurthy, Prasanna Sattigeri, Manish Nagireddy, Pierre Dognin, Kush R. Varshney

TL;DR

The paper addresses the challenge of value alignment for LLMs when values are embedded in unstructured text. It proposes an end-to-end pipeline that automatically extracts value signals from documents using a large teacher model to generate synthetic instruct data ($ ext{D}_{ ext{sft}}$) and synthetic preference data ($ ext{D}_{ ext{pref}}$), then applies supervised fine-tuning followed by direct preference optimization. The approach is demonstrated on two use-cases (BCG and UDHR) with Mistral-7B-Instruct as the base model, showing that a model trained with $\text{D}_{\text{sft}}$ and optimized with $\text{D}_{\text{pref}}$ via DPO outperforms baselines across multiple metrics, while RAG may reduce performance in this setting. The method eliminates reliance on costly human curation and demonstrates adaptability to different value documents, offering a scalable path toward ethical, domain-specific value alignment for LLMs.

Abstract

Aligning large language models (LLMs) to value systems has emerged as a significant area of research within the fields of AI and NLP. Currently, this alignment process relies on the availability of high-quality supervised and preference data, which can be both time-consuming and expensive to curate or annotate. In this paper, we introduce a systematic end-to-end methodology for aligning LLMs to the implicit and explicit values represented in unstructured text data. Our proposed approach leverages the use of scalable synthetic data generation techniques to effectively align the model to the values present in the unstructured data. Through two distinct use-cases, we demonstrate the efficiency of our methodology on the Mistral-7B-Instruct model. Our approach credibly aligns LLMs to the values embedded within documents, and shows improved performance against other approaches, as quantified through the use of automatic metrics and win rates.

Value Alignment from Unstructured Text

TL;DR

) and synthetic preference data (

), then applies supervised fine-tuning followed by direct preference optimization. The approach is demonstrated on two use-cases (BCG and UDHR) with Mistral-7B-Instruct as the base model, showing that a model trained with

and optimized with

via DPO outperforms baselines across multiple metrics, while RAG may reduce performance in this setting. The method eliminates reliance on costly human curation and demonstrates adaptability to different value documents, offering a scalable path toward ethical, domain-specific value alignment for LLMs.

Abstract

Paper Structure (15 sections, 2 equations, 8 figures, 4 tables)

This paper contains 15 sections, 2 equations, 8 figures, 4 tables.

Introduction
Alignment from Unsupervised Data
Synthetic Data Generation
Algorithms
End-to-end Pipeline
Experimental Setup
Use Cases
Methods
Evaluation
Experimental Results and Discussion
Conclusion
Prompt Templates for Synthetic Data Generation
Qualitative Analysis
Training Details
Win Rates Comparison

Figures (8)

Figure 1: End-to-end View: Our alignment method involves instruct and scenario SDGs steps, which are then leveraged for SFT and preference optimization.
Figure 2: Instruct SDG, $\mathcal{D}_{\text{sft}}$: Synthetic data generation pipeline for creation instruction data.
Figure 3: Preference SDG, $\mathcal{D}_{\text{pref}}$ : Synthetic data generation pipeline for creation of synthetic scenario or preference data.
Figure 4: Prompt template for question generation as used in $\mathcal{D}_{\text{sft}}$ pipeline.
Figure 5: Prompt template for answer generation as used in $\mathcal{D}_{\text{sft}}$ pipeline.
...and 3 more figures

Value Alignment from Unstructured Text

TL;DR

Abstract

Value Alignment from Unstructured Text

Authors

TL;DR

Abstract

Table of Contents

Figures (8)