Table of Contents
Fetching ...

C3AI: Crafting and Evaluating Constitutions for Constitutional AI

Yara Kyrychenko, Ke Zhou, Edyta Bogucka, Daniele Quercia

TL;DR

The paper introduces the C3AI framework for crafting and evaluating CAI constitutions, detailing a two-part process: (i) assembling a principled constitution from a large item pool via item selection, transformation, and three principled selection approaches, and (ii) evaluating model adherence through principle-specific and use-specific tests. It demonstrates that positively framed, behavior-based principles align more closely with human preferences, and uses Exploratory Graph Analysis to identify six latent principle factors, enabling a compact, effective constitution that preserves general capabilities while improving safety in a case study. The framework supports automated or synthetic evaluation of principles, reducing reliance on extensive human preference data, and shows that ORPO-fine-tuning with a refined principle set can enhance safety without harming reasoning. The work also discusses limitations and broad applicability to governance and compliance tasks, offering open-source access for adaptation to diverse use cases.

Abstract

Constitutional AI (CAI) guides LLM behavior using constitutions, but identifying which principles are most effective for model alignment remains an open challenge. We introduce the C3AI framework (\textit{Crafting Constitutions for CAI models}), which serves two key functions: (1) selecting and structuring principles to form effective constitutions before fine-tuning; and (2) evaluating whether fine-tuned CAI models follow these principles in practice. By analyzing principles from AI and psychology, we found that positively framed, behavior-based principles align more closely with human preferences than negatively framed or trait-based principles. In a safety alignment use case, we applied a graph-based principle selection method to refine an existing CAI constitution, improving safety measures while maintaining strong general reasoning capabilities. Interestingly, fine-tuned CAI models performed well on negatively framed principles but struggled with positively framed ones, in contrast to our human alignment results. This highlights a potential gap between principle design and model adherence. Overall, C3AI provides a structured and scalable approach to both crafting and evaluating CAI constitutions.

C3AI: Crafting and Evaluating Constitutions for Constitutional AI

TL;DR

The paper introduces the C3AI framework for crafting and evaluating CAI constitutions, detailing a two-part process: (i) assembling a principled constitution from a large item pool via item selection, transformation, and three principled selection approaches, and (ii) evaluating model adherence through principle-specific and use-specific tests. It demonstrates that positively framed, behavior-based principles align more closely with human preferences, and uses Exploratory Graph Analysis to identify six latent principle factors, enabling a compact, effective constitution that preserves general capabilities while improving safety in a case study. The framework supports automated or synthetic evaluation of principles, reducing reliance on extensive human preference data, and shows that ORPO-fine-tuning with a refined principle set can enhance safety without harming reasoning. The work also discusses limitations and broad applicability to governance and compliance tasks, offering open-source access for adaptation to diverse use cases.

Abstract

Constitutional AI (CAI) guides LLM behavior using constitutions, but identifying which principles are most effective for model alignment remains an open challenge. We introduce the C3AI framework (\textit{Crafting Constitutions for CAI models}), which serves two key functions: (1) selecting and structuring principles to form effective constitutions before fine-tuning; and (2) evaluating whether fine-tuned CAI models follow these principles in practice. By analyzing principles from AI and psychology, we found that positively framed, behavior-based principles align more closely with human preferences than negatively framed or trait-based principles. In a safety alignment use case, we applied a graph-based principle selection method to refine an existing CAI constitution, improving safety measures while maintaining strong general reasoning capabilities. Interestingly, fine-tuned CAI models performed well on negatively framed principles but struggled with positively framed ones, in contrast to our human alignment results. This highlights a potential gap between principle design and model adherence. Overall, C3AI provides a structured and scalable approach to both crafting and evaluating CAI constitutions.

Paper Structure

This paper contains 25 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The C3AI framework serves two key functions: (1) crafting constitutions and (2) evaluating whether models adhere to their constitutions. Crafting involves three steps: selecting relevant items for a specific use case (Item Selection in §\ref{['sec:item-selection']}), converting them into standardized, human-understandable statements and machine-readable principles (Item Transformation in §\ref{['sec:item-transformation']}), and curating a final set of principles to form a constitution (Principle Selection in §\ref{['sec:principle-selection']}). Evaluating model adherence (§\ref{['sec:evaluating']}) assesses how well the model follows specific principles (§\ref{['sec:principle-specific']}) and whether it aligns with intended uses by, for example, effectively supporting safety or mathematical reasoning (§\ref{['sec:application-specific']}).
  • Figure 2: EGA graph where nodes represent 185 principles, and edges are weighted by the correlation between principle pairs. Thicker edges indicate stronger absolute values; continuous edges represent positive correlations, while dashed edges indicate negative correlations. These correlations are derived from 1,800 conversations spanning three conversational objectives, which aim at ensuring that conversations are harmless, helpful, and effective in general-purpose contexts. The graph depicts the median graph from 500 bootstrapped EGA runs, with nodes removed during UVA omitted. The six principle factors are reported along with their dataset sources. Nodes of the same color belong to the same factor, while nodes with three distinct shapes correspond to the principles with the highest principle-objective alignment - those best suited for each of the three conversational objectives. Node size reflects the overall strength of the node's connections within the graph.