Aligning Language Models with Demonstrated Feedback

Omar Shaikh; Michelle S. Lam; Joey Hejna; Yijia Shao; Hyundong Cho; Michael S. Bernstein; Diyi Yang

Aligning Language Models with Demonstrated Feedback

Omar Shaikh, Michelle S. Lam, Joey Hejna, Yijia Shao, Hyundong Cho, Michael S. Bernstein, Diyi Yang

TL;DR

<3-5 sentence high-level summary> DITTO presents a data-efficient, demonstration-driven approach to personalize LLM alignment to individual users or tasks by turning a handful of user-provided demonstrations into online comparison data. The method frames alignment as a KL-constrained, online imitation learning problem and updates via a preference optimization objective (e.g., DPO), leveraging inter-model and replay comparisons to improve robustness. Across static author-writing benchmarks and a user study with real demonstrations, DITTO outperforms few-shot prompting, SFT, and self-play methods by substantial margins, while also showing favorable sample efficiency. The work highlights a practical pathway for rapid, user-specific customization of LLMs and motivates future exploration of demonstration quality and interaction design for feedback collection.

Abstract

Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. Steering LLMs away from generic output is possible through supervised finetuning or RLHF, but requires prohibitively large datasets for new ad-hoc tasks. We argue that it is instead possible to align an LLM to a specific setting by leveraging a very small number (< 10) of demonstrations as feedback. Our method, Demonstration ITerated Task Optimization (DITTO), directly aligns language model outputs to a user's demonstrated behaviors. Derived using ideas from online imitation learning, DITTO cheaply generates online comparison data by treating users' demonstrations as preferred over output from the LLM and its intermediate checkpoints. Concretely, DITTO operates by having an LLM generate examples that are presumed to be inferior to expert demonstrations. The method iteratively constructs pairwise preference relationships between these LLM-generated samples and expert demonstrations, potentially including comparisons between different training checkpoints. These constructed preference pairs are then used to train the model using a preference optimization algorithm (e.g. DPO). We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts. Additionally, we conduct a user study soliciting a range of demonstrations from participants (N = 16). Across our benchmarks and user study, we find that win-rates for DITTO outperform few-shot prompting, supervised fine-tuning, and other self-play methods by an avg. of 19% points. By using demonstrations as feedback directly, DITTO offers a novel method for effective customization of LLMs.

Aligning Language Models with Demonstrated Feedback

TL;DR

Abstract

Paper Structure (49 sections, 4 theorems, 16 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 49 sections, 4 theorems, 16 equations, 9 figures, 7 tables, 1 algorithm.

Introduction
Related Work
DITTO
Notation and Background
DITTO
Generating Comparisons.
From Comparisons to Rankings.
A Practical Algorithm.
Deriving DITTO as Online Imitation Learning
Deriving DITTO.
Why does DITTO work better than SFT alone?
Experiments
Static Benchmarks
Data
Splits and Preprocessing
...and 34 more sections

Key Result

Lemma 3.1

(Adapted from brown2020better) Let $\pi^\star$ be the optimal policy for eq:obj and $\hat{\pi}$ be the policy estimated by DITTO using expert demonstrations $\mathcal{D}_E$. Extrapolation beyond the demonstrator, i.e. $\mathbb{E}_{\hat{\pi}}[r(x,y)] > \mathbb{E}_{\mathcal{D}_E}[r(x,y)]$ is guarantee

Figures (9)

Figure 1: DITTO iteratively aligns LLMs to demonstrated behavior. When a user supplies demonstrations (through edits to a model's output, past preferred interaction history, or writing examples from scratch), DITTO treats these demonstrations as preferred to all model behavior, including earlier iterations of the trained model. Using demonstrations as feedback allows for cheap generation of online comparison data and enables few-shot alignment with just a handful of samples.
Figure 2: Head-to-head win rates across DITTO hyperparameter perturbations on CMCC. First, increasing the number of DITTO iterations improves GPT-4 eval performance (left). Increasing the number of generated negatives also reduces DITTO variance across users while improving DITTO performance (middle). Finally, increasing demos also improves performance, but we observe diminishing returns (right). Error bars correspond to standard error of the mean across authors.
Figure 3: Demonstrations are more sample efficient than pairwise preferences for an individual user. We compared DITTO with 4 demos to pairwise prefs sampled from (1) base instruction-following LM $\pi_\textrm{ref}$ and (2) $\pi_\textrm{ref}$ fine-tuned on demos. Applying DPO on 500 pairwise preferences---with samples from $\pi_\textrm{ref}$---yields no improvement compared to DITTO. Even if demos are used to fine-tune $\pi_\textrm{ref}$ before sampling, one must collect many pairwise preferences to approach DITTO.
Figure 4: Few-shot prompt used to generate outputs for few-shot examples. We additionally test ablations in red text, but find that this reduces win rates for few-shot methods by 4% pts.
Figure 5: Ablations for the number of demonstrations in few-shot prompted GPT-4. We report win-rate vs. DITTO for a varying number of demonstrations in the few-shot prompt. While increasing the number of demonstrations in the prompt is positively correlated with improved performance, win rates are well under 50% and improvements are non-monotonic, with notable variance as we continue adding demonstrations.
...and 4 more figures

Theorems & Definitions (4)

Lemma 3.1
Proposition B.1
Corollary B.2
Lemma B.3

Aligning Language Models with Demonstrated Feedback

TL;DR

Abstract

Aligning Language Models with Demonstrated Feedback

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (4)