C3AI: Crafting and Evaluating Constitutions for Constitutional AI
Yara Kyrychenko, Ke Zhou, Edyta Bogucka, Daniele Quercia
TL;DR
The paper introduces the C3AI framework for crafting and evaluating CAI constitutions, detailing a two-part process: (i) assembling a principled constitution from a large item pool via item selection, transformation, and three principled selection approaches, and (ii) evaluating model adherence through principle-specific and use-specific tests. It demonstrates that positively framed, behavior-based principles align more closely with human preferences, and uses Exploratory Graph Analysis to identify six latent principle factors, enabling a compact, effective constitution that preserves general capabilities while improving safety in a case study. The framework supports automated or synthetic evaluation of principles, reducing reliance on extensive human preference data, and shows that ORPO-fine-tuning with a refined principle set can enhance safety without harming reasoning. The work also discusses limitations and broad applicability to governance and compliance tasks, offering open-source access for adaptation to diverse use cases.
Abstract
Constitutional AI (CAI) guides LLM behavior using constitutions, but identifying which principles are most effective for model alignment remains an open challenge. We introduce the C3AI framework (\textit{Crafting Constitutions for CAI models}), which serves two key functions: (1) selecting and structuring principles to form effective constitutions before fine-tuning; and (2) evaluating whether fine-tuned CAI models follow these principles in practice. By analyzing principles from AI and psychology, we found that positively framed, behavior-based principles align more closely with human preferences than negatively framed or trait-based principles. In a safety alignment use case, we applied a graph-based principle selection method to refine an existing CAI constitution, improving safety measures while maintaining strong general reasoning capabilities. Interestingly, fine-tuned CAI models performed well on negatively framed principles but struggled with positively framed ones, in contrast to our human alignment results. This highlights a potential gap between principle design and model adherence. Overall, C3AI provides a structured and scalable approach to both crafting and evaluating CAI constitutions.
