Table of Contents
Fetching ...

Aligners: Decoupling LLMs and Alignment

Lilian Ngweta, Mayank Agarwal, Subha Maity, Alex Gittens, Yuekai Sun, Mikhail Yurochkin

TL;DR

This work proposes to decouple LLMs and alignment by training aligner models that can be used to align any LLM for a given criteria on an as-needed basis, thus also reducing the potential negative impacts of alignment on performance.

Abstract

Large Language Models (LLMs) need to be aligned with human expectations to ensure their safety and utility in most applications. Alignment is challenging, costly, and needs to be repeated for every LLM and alignment criterion. We propose to decouple LLMs and alignment by training aligner models that can be used to align any LLM for a given criteria on an as-needed basis, thus also reducing the potential negative impacts of alignment on performance. Our recipe for training the aligner models solely relies on synthetic data generated with a (prompted) LLM and can be easily adjusted for a variety of alignment criteria. We use the same synthetic data to train inspectors, binary miss-alignment classification models to guide a "squad" of multiple aligners. Our empirical results demonstrate consistent improvements when applying aligner squad to various LLMs, including chat-aligned models, across several instruction-following and red-teaming datasets.

Aligners: Decoupling LLMs and Alignment

TL;DR

This work proposes to decouple LLMs and alignment by training aligner models that can be used to align any LLM for a given criteria on an as-needed basis, thus also reducing the potential negative impacts of alignment on performance.

Abstract

Large Language Models (LLMs) need to be aligned with human expectations to ensure their safety and utility in most applications. Alignment is challenging, costly, and needs to be repeated for every LLM and alignment criterion. We propose to decouple LLMs and alignment by training aligner models that can be used to align any LLM for a given criteria on an as-needed basis, thus also reducing the potential negative impacts of alignment on performance. Our recipe for training the aligner models solely relies on synthetic data generated with a (prompted) LLM and can be easily adjusted for a variety of alignment criteria. We use the same synthetic data to train inspectors, binary miss-alignment classification models to guide a "squad" of multiple aligners. Our empirical results demonstrate consistent improvements when applying aligner squad to various LLMs, including chat-aligned models, across several instruction-following and red-teaming datasets.
Paper Structure (31 sections, 3 equations, 6 figures, 6 tables)

This paper contains 31 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Our proposed work pipeline where we start by generating synthetic data that we use to train inspectors and aligners. We then use trained aligners and inspectors to align responses from existing LLMs. Finally, we evaluate aligned responses using popular evaluators such as GPT-4 via AlpacaEval 2.0 alpaca_eval and PairRM llm-blender-2023.
  • Figure 2: Trajectories of inspector scores when the styles are independent of each other (left), aligning with one style improves (middle) or harm (right) the other style. In the first two cases the desired alignment is achieved, whereas in the last case the alignment is not achieved by the aligner squad.
  • Figure 3: Phi-2 aligners squad results on all 14 harm categories of the BeaverTails-Evaluation dataset, where the base responses aligned by Phi-2 aligners squad were generated by Llama-2-13B. Our aligners squad does well on categories that are relevant to our aligner types (first four), but is less effective on the others. The flexibility of our pipeline allows training aligners for other categories if desired.
  • Figure 4: Plots showing the effect of applying Phi-2 aligners squad on base responses from Llama-2-70B. The application of first aligner significantly improves the other alignment scores.
  • Figure 5: Examples of RedPajama-3B ethical aligner's responses.
  • ...and 1 more figures