Towards Aligning Language Models with Textual Feedback

Saüc Abadal Lloret; Shehzaad Dhuliawala; Keerthiram Murugesan; Mrinmaya Sachan

Towards Aligning Language Models with Textual Feedback

Saüc Abadal Lloret, Shehzaad Dhuliawala, Keerthiram Murugesan, Mrinmaya Sachan

TL;DR

ALT reframes alignment as a conditional sequence modeling problem that uses textual feedback to guide generation. It collects samples annotated with natural-language feedback and trains a model to maximize the conditional likelihood $\log p_\theta(y|x,f)$, optimizing $L_\theta = L_{NLL} + \beta L_{ref} + \alpha L_H$ to balance alignment and policy stability. Across toxicity reduction, summarization, and dialog, ALT achieves stronger or more data-efficient performance than PPO and standard baselines, including $62\%$ toxicity reduction and roughly $20\%$ training data to reach PPO-type performance in summarization, with LLM-based feedback enabling effective steering. These results suggest that natural-language feedback can provide a richer learning signal for alignment and pave the way for scalable, user-friendly alignment with minimal hyperparameter tuning.

Abstract

We present ALT (ALignment with Textual feedback), an approach that aligns language models with user preferences expressed in text. We argue that text offers greater expressiveness, enabling users to provide richer feedback than simple comparative preferences and this richer feedback can lead to more efficient and effective alignment. ALT aligns the model by conditioning its generation on the textual feedback. Our method relies solely on language modeling techniques and requires minimal hyper-parameter tuning, though it still presents the main benefits of RL-based alignment algorithms and can effectively learn from textual feedback. We explore the efficacy and efficiency of textual feedback across different tasks such as toxicity reduction, summarization, and dialog response generation. We find that ALT outperforms PPO for the task of toxicity reduction while being able to match its performance on summarization with only 20% of the samples. We also explore how ALT can be used with feedback provided by an existing LLM where we explore an LLM providing constrained and unconstrained textual feedback. We also outline future directions to align models with natural language feedback.

Towards Aligning Language Models with Textual Feedback

TL;DR

, optimizing

to balance alignment and policy stability. Across toxicity reduction, summarization, and dialog, ALT achieves stronger or more data-efficient performance than PPO and standard baselines, including

toxicity reduction and roughly

training data to reach PPO-type performance in summarization, with LLM-based feedback enabling effective steering. These results suggest that natural-language feedback can provide a richer learning signal for alignment and pave the way for scalable, user-friendly alignment with minimal hyperparameter tuning.

Abstract

Paper Structure (38 sections, 4 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 38 sections, 4 equations, 11 figures, 7 tables, 1 algorithm.

Introduction
ALT: ALignment with Textual feedback
Data Collection: Sampling + Feedback
Training
Feedback Provider
Reward Model Feedback
LLM-based Categorical Feedback
LLM-based Unconstrained Feedback
Exemplar Feedback
Tasks
Toxicity Reduction
Experimental details.
Summarization
Experimental details.
Dialog Response Generation
...and 23 more sections

Figures (11)

Figure 1: A basic schematic for ALT. Steps 1) Sampling and 2) Textual Feedback encompass the Data collection phase, in which we sample multiple generations from the LLM policy and annotate the samples with textual feedback; and Step 3) Conditional Supervised Fine-Tuning refers to the Training phase, in which we fine-tune the current LLM policy on the collected data using \ref{['eq:loss']}. The 3 steps are repeated for a total of N iterations. In the first iteration, we sample from a reference initial policy without conditioning on any feedback. In subsequent iterations, we sample from the previously fine-tuned policy conditioned on specific exemplar feedback that represents the desired behavior to which we want to steer our policy.
Figure 2: Training curves for Quark and $\textsc{ALT}_{\textsc{RM}}$. Evaluation on the validation set. $\textsc{ALT}_{\textsc{RM}}$ achieves a higher reward model score than Quark and also learns much faster. Each iteration corresponds to 2k training samples.
Figure 3: Training curves for $\textsc{ALT}_{\textsc{LMC}}$ on HH. The percentage of Harmless and very helpful generations increases while the percentage of Harmful generations decreases. Each iteration corresponds to 2k training samples.
Figure 4: Evaluation metrics on the unlearning toxicity experiment as the training of $\textsc{ALT}_{\textsc{RM}}$ progresses.
Figure 5: Training curves showing the generations' length (left axis) and the % of truncated generations (right axis) for $\textsc{ALT}_{\textsc{LMC}}$ on HH. Evaluation on a held-out validation set. Chosen refers to the human-preferred responses on the HH-RLHF dataset. $\textsc{ALT}_{\textsc{LMC}}$ manages to stay the closest to the SFT model in terms of generations' length (avg. $\sim$ 120 tokens), followed by SteerLM (avg. $\sim$ 160 tokens) and DPO (avg. $\sim$ 240 tokens). Regarding the % of truncated generations, both $\textsc{ALT}_{\textsc{LMC}}$ and SteerLM follow a similar trend and present around half of the SFT truncated generations ($\sim$ 10%), whereas DPO has over 70% of its generations being truncated.
...and 6 more figures

Towards Aligning Language Models with Textual Feedback

TL;DR

Abstract

Towards Aligning Language Models with Textual Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (11)