Table of Contents
Fetching ...

Decoding Human Preferences in Alignment: An Improved Approach to Inverse Constitutional AI

Carl-Leander Henneking, Claas Beger

TL;DR

This work targets the interpretability gap in traditional LLM alignment by advancing Inverse Constitutional AI (ICAI) to extract clearer, more general constitutional principles from pairwise preference datasets. It introduces improvements in principle generation, clustering, and embedding strategies—including centroid-based subsampling and multi-embedding difference maps—to enhance generalizability across synthetic, semi-synthetic, and real data. Empirically, the enhanced ICAI variants improve preference regeneration and constitutional similarity, with notable gains in synthetic and semi-synthetic settings, while real-data performance remains mixed and prompts further investigation. The study also demonstrates that incorporating preference scores into prompting can boost extraction quality, underscoring the potential of constitutions as transparent artifacts for steering alignment, albeit with scalability and evaluation challenges that future work must address.

Abstract

Traditional methods for aligning Large Language Models (LLMs), such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on implicit principles, limiting interpretability. Constitutional AI (CAI) offers an explicit, rule-based framework for guiding LLM alignment. Building on this, we refine the Inverse Constitutional AI (ICAI) algorithm, which extracts constitutions from preference datasets. By improving principle generation, clustering, and embedding processes, our approach enhances the accuracy and generalizability of extracted principles across synthetic and real-world datasets. Our results highlight the potential of these principles to foster more transparent and adaptable alignment methods, offering a promising direction for future advancements beyond traditional fine-tuning.

Decoding Human Preferences in Alignment: An Improved Approach to Inverse Constitutional AI

TL;DR

This work targets the interpretability gap in traditional LLM alignment by advancing Inverse Constitutional AI (ICAI) to extract clearer, more general constitutional principles from pairwise preference datasets. It introduces improvements in principle generation, clustering, and embedding strategies—including centroid-based subsampling and multi-embedding difference maps—to enhance generalizability across synthetic, semi-synthetic, and real data. Empirically, the enhanced ICAI variants improve preference regeneration and constitutional similarity, with notable gains in synthetic and semi-synthetic settings, while real-data performance remains mixed and prompts further investigation. The study also demonstrates that incorporating preference scores into prompting can boost extraction quality, underscoring the potential of constitutions as transparent artifacts for steering alignment, albeit with scalability and evaluation challenges that future work must address.

Abstract

Traditional methods for aligning Large Language Models (LLMs), such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on implicit principles, limiting interpretability. Constitutional AI (CAI) offers an explicit, rule-based framework for guiding LLM alignment. Building on this, we refine the Inverse Constitutional AI (ICAI) algorithm, which extracts constitutions from preference datasets. By improving principle generation, clustering, and embedding processes, our approach enhances the accuracy and generalizability of extracted principles across synthetic and real-world datasets. Our results highlight the potential of these principles to foster more transparent and adaptable alignment methods, offering a promising direction for future advancements beyond traditional fine-tuning.

Paper Structure

This paper contains 11 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: We apply KMeans on the difference between the embeddings of the preference pairs. After estimating cluster potential, we extract representative node triplets, which are inserted in one joint principle generation prompt. Prompts are executed separately for each dimension.
  • Figure 2: Dataset generation process for the synthetic dataset. Colors in the final dataset represent the samples that were generated using the same principle.
  • Figure 3: Dataset generation process for the semi-synthetic dataset. Different shades of purple in the final dataset indicate different deltas between chosen and rejected ratings.