Table of Contents
Fetching ...

Inverse Constitutional AI: Compressing Preferences into Principles

Arduin Findeis, Timo Kaufmann, Eyke Hüllermeier, Samuel Albanie, Robert Mullins

TL;DR

This paper reframes the interpretation of pairwise preference data as an inverse compression problem, introducing Inverse Constitutional AI (ICAI). It defines a first algorithm that generates, clusters, tests, and filters natural-language principles (a constitution) to enable an LLM to reconstruct the original annotations, producing a compact and interpretable representation of preferences. Through experiments on synthetic data, AlpacaEval, Chatbot Arena, and PRISM, ICAI demonstrates reconstruction fidelity, bias discovery, and potential for personalized and group-specific constitutions, as well as annotation scaling. The work emphasizes interpretability, transferability across models, and practical use cases, while acknowledging limitations such as non-uniqueness and lossy compression, and releases code to support reproducibility.

Abstract

Feedback data is widely used for fine-tuning and evaluating state-of-the-art AI models. Pairwise text preferences, where human or AI annotators select the "better" of two options, are particularly common. Such preferences are used to train (reward) models or to rank models with aggregate statistics. For many applications it is desirable to understand annotator preferences in addition to modelling them - not least because extensive prior work has shown various unintended biases in preference datasets. Yet, preference datasets remain challenging to interpret. Neither black-box reward models nor statistics can answer why one text is preferred over another. Manual interpretation of the numerous (long) response pairs is usually equally infeasible. In this paper, we introduce the Inverse Constitutional AI (ICAI) problem, formulating the interpretation of pairwise text preference data as a compression task. In constitutional AI, a set of principles (a constitution) is used to provide feedback and fine-tune AI models. ICAI inverts this process: given a feedback dataset, we aim to extract a constitution that best enables a large language model (LLM) to reconstruct the original annotations. We propose a corresponding ICAI algorithm and validate its generated constitutions quantitatively based on annotation reconstruction accuracy on several datasets: (a) synthetic feedback data with known principles; (b) AlpacaEval cross-annotated human feedback data; (c) crowdsourced Chatbot Arena data; and (d) PRISM data from diverse demographic groups. As a short and interpretable representation of the original dataset, generated constitutions have many potential use cases: help identify undesirable annotator biases, understand model performance better, scale feedback to unseen data, or adapt models to individual user or group preferences. We release the source code at https://github.com/rdnfn/icai.

Inverse Constitutional AI: Compressing Preferences into Principles

TL;DR

This paper reframes the interpretation of pairwise preference data as an inverse compression problem, introducing Inverse Constitutional AI (ICAI). It defines a first algorithm that generates, clusters, tests, and filters natural-language principles (a constitution) to enable an LLM to reconstruct the original annotations, producing a compact and interpretable representation of preferences. Through experiments on synthetic data, AlpacaEval, Chatbot Arena, and PRISM, ICAI demonstrates reconstruction fidelity, bias discovery, and potential for personalized and group-specific constitutions, as well as annotation scaling. The work emphasizes interpretability, transferability across models, and practical use cases, while acknowledging limitations such as non-uniqueness and lossy compression, and releases code to support reproducibility.

Abstract

Feedback data is widely used for fine-tuning and evaluating state-of-the-art AI models. Pairwise text preferences, where human or AI annotators select the "better" of two options, are particularly common. Such preferences are used to train (reward) models or to rank models with aggregate statistics. For many applications it is desirable to understand annotator preferences in addition to modelling them - not least because extensive prior work has shown various unintended biases in preference datasets. Yet, preference datasets remain challenging to interpret. Neither black-box reward models nor statistics can answer why one text is preferred over another. Manual interpretation of the numerous (long) response pairs is usually equally infeasible. In this paper, we introduce the Inverse Constitutional AI (ICAI) problem, formulating the interpretation of pairwise text preference data as a compression task. In constitutional AI, a set of principles (a constitution) is used to provide feedback and fine-tune AI models. ICAI inverts this process: given a feedback dataset, we aim to extract a constitution that best enables a large language model (LLM) to reconstruct the original annotations. We propose a corresponding ICAI algorithm and validate its generated constitutions quantitatively based on annotation reconstruction accuracy on several datasets: (a) synthetic feedback data with known principles; (b) AlpacaEval cross-annotated human feedback data; (c) crowdsourced Chatbot Arena data; and (d) PRISM data from diverse demographic groups. As a short and interpretable representation of the original dataset, generated constitutions have many potential use cases: help identify undesirable annotator biases, understand model performance better, scale feedback to unseen data, or adapt models to individual user or group preferences. We release the source code at https://github.com/rdnfn/icai.
Paper Structure (63 sections, 1 equation, 10 figures, 10 tables)

This paper contains 63 sections, 1 equation, 10 figures, 10 tables.

Figures (10)

  • Figure 1: The Inverse Constitutional AI problem. Starting with pairwise preference feedback data, we derive a set of natural language principles (a constitution) that explain the preferences. For validation, we reconstruct the original preferences with an LLM judging according to the generated constitution. The constitution represents a (highly compact) compression of the preferences.
  • Figure 2: Overview of our Inverse Constitutional AI (ICAI) algorithm. Given a dataset of pairwise comparisons, in Step 1 candidate principles are generated using an LLM. In Step 2, these principles are clustered using an embedding model. In Step 3, similar principles are deduplicated by sampling one principle per cluster. In Step 4, each principle is tested to evaluate its ability to help an LLM reconstruct the original annotations. Finally, in Step 5, the principles are filtered according to the testing results, and a set of filtered principles are returned as the final constitution. Optionally, a final step of additional clustering and subsampling can follow to ensure diverse principles.
  • Figure 3: Results on synthetic data. Our constitutional annotators can reconstruct a variety of preferences using limited data and without fine-tuning. We demonstrate our algorithm's adaptability on three synthetic datasets: one orthogonal to the base LLM's learned preferences, one aligned with those preferences and one unaligned with them. We generate constitutions for each and report agreement with the original data of a default LLM annotator and a constitutional annotator (prompted with a constitution). Our constitutions notably improve agreement in the orthogonal and unaligned cases and retain high agreement in the aligned case, albeit with more variance. Our method's ability to detect biases is illustrated by the example constitution in the unaligned case. Plots show mean and standard deviation (6 seeds) using GPT-3.5-Turbo.
  • Figure 4: Results on AlpacaEval data. GPT-4o generates and uses interpretable constitutions that match the performance of the default annotator on aligned preferences and notably increase agreement with unaligned preferences. Tested on aligned (original) and unaligned (flipped) versions of AlpacaEval, with GPT-4o generating constitutions which are then used by constitutional annotators backed by GPT-4o and GPT-3.5-Turbo. Note we can only expect significant improvement in the unaligned case, as discussed in the main text. The aligned case does not leave room for improvement over the default annotator, but allows us to gain new insights into the preferences expressed in the dataset. In the unaligned case, GPT-4o's agreement improves notably, while GPT-3.5-Turbo's performance does not exceed random choice, indicating its limited ability to follow unaligned principles. Plots show mean and standard deviation (6 seeds).
  • Figure 5: Case-study: Constitutions for demographic groups on PRISM data. We consider two groups reported by kirk2024PRISMAlignmentProject to have preferences differing from average: participants born in one geographical region rank Mistral-7b higher in this dataset (Group A), and those born in another region rank Llama-2-7b lower than average (Group B). We generate constitutions for both groups to explore these preferences. For each group, the annotator using the group's data performs best. Constitutions (see \ref{['app:prism_constituions']}) suggest that Group A prefers Mistral-7b due to it's conciseness, while Group B's constitutions have recurring rules related to providing more detailed descriptions. Plots show mean and standard deviation (6 seeds) using GPT-4o.
  • ...and 5 more figures