Table of Contents
Fetching ...

IterAlign: Iterative Constitutional Alignment of Large Language Models

Xiusi Chen, Hongzhi Wen, Sreyashi Nag, Chen Luo, Qingyu Yin, Ruirui Li, Zheng Li, Wei Wang

TL;DR

A data-driven constitution discovery and self-alignment framework called IterAlign is proposed, which improves truthfulness, helpfulness, harmlessness and honesty, improving the LLM alignment by up to 13.5% in harmlessness.

Abstract

With the rapid development of large language models (LLMs), aligning LLMs with human values and societal norms to ensure their reliability and safety has become crucial. Reinforcement learning with human feedback (RLHF) and Constitutional AI (CAI) have been proposed for LLM alignment. However, these methods require either heavy human annotations or explicitly pre-defined constitutions, which are labor-intensive and resource-consuming. To overcome these drawbacks, we study constitution-based LLM alignment and propose a data-driven constitution discovery and self-alignment framework called IterAlign. IterAlign leverages red teaming to unveil the weaknesses of an LLM and automatically discovers new constitutions using a stronger LLM. These constitutions are then used to guide self-correction of the base LLM. Such a constitution discovery pipeline can be run iteratively and automatically to discover new constitutions that specifically target the alignment gaps in the current LLM. Empirical results on several safety benchmark datasets and multiple base LLMs show that IterAlign successfully improves truthfulness, helpfulness, harmlessness and honesty, improving the LLM alignment by up to $13.5\%$ in harmlessness.

IterAlign: Iterative Constitutional Alignment of Large Language Models

TL;DR

A data-driven constitution discovery and self-alignment framework called IterAlign is proposed, which improves truthfulness, helpfulness, harmlessness and honesty, improving the LLM alignment by up to 13.5% in harmlessness.

Abstract

With the rapid development of large language models (LLMs), aligning LLMs with human values and societal norms to ensure their reliability and safety has become crucial. Reinforcement learning with human feedback (RLHF) and Constitutional AI (CAI) have been proposed for LLM alignment. However, these methods require either heavy human annotations or explicitly pre-defined constitutions, which are labor-intensive and resource-consuming. To overcome these drawbacks, we study constitution-based LLM alignment and propose a data-driven constitution discovery and self-alignment framework called IterAlign. IterAlign leverages red teaming to unveil the weaknesses of an LLM and automatically discovers new constitutions using a stronger LLM. These constitutions are then used to guide self-correction of the base LLM. Such a constitution discovery pipeline can be run iteratively and automatically to discover new constitutions that specifically target the alignment gaps in the current LLM. Empirical results on several safety benchmark datasets and multiple base LLMs show that IterAlign successfully improves truthfulness, helpfulness, harmlessness and honesty, improving the LLM alignment by up to in harmlessness.
Paper Structure (25 sections, 1 equation, 3 figures, 3 tables)

This paper contains 25 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Framework overview for IterAlign.IterAlign begins with red teaming the base LLM to test and collect responses, followed by evaluation using an oracle model to identify improper responses. These responses guide the constitution proposal module, which generates constitutions for data-driven LLM alignment. Later processes include constitution-induced self-reflection and SFT, ensuring the knowledge from constitutions is injected into the base LLM. IterAlign operates iteratively, continually identifying new challenging instances and refining the model to cover a broad spectrum of ethical standards.
  • Figure 2: (a, b): TruthfulQA Generation task evaluation results. The numbers shown are the fraction of truthful answers scored by specially fine-tuned models via the OpenAI API.
  • Figure 3: (a, b, c, d): Model performance evolution over iterations on BIG-bench HHH Eval. The numbers shown are for Vicuna-7B with Anthropic hh-rlhf. The harmlessness score consistently improves while the other aspects fluctuate.