Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

Di Jin; Shikib Mehri; Devamanyu Hazarika; Aishwarya Padmakumar; Sungjin Lee; Yang Liu; Mahdi Namazifar

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

Di Jin, Shikib Mehri, Devamanyu Hazarika, Aishwarya Padmakumar, Sungjin Lee, Yang Liu, Mahdi Namazifar

TL;DR

The paper presents Critique-and-Revise (CnR), a data-efficient approach to align LLMs with human preferences by training on natural language critiques and revisions. By fine-tuning open-source models on fewer than 1,000 CnR annotations, the method enables a single model to critique and revise responses, improving outputs from strong systems like ChatGPT and GPT-4 through iterative revisions. Across automatic and human evaluations, CnR demonstrates substantial gains (e.g., 56.6% initial win rate, rising to 65.9% after five iterations) and reveals that instruction-tuning quality, data size, and model scale significantly influence revision quality. The work highlights the practicality of data-efficient, interpretable alignment with human feedback and discusses limitations and directions for extending CnR to harder tasks and multi-turn interactions.

Abstract

Learning from human feedback is a prominent technique to align the output of large language models (LLMs) with human expectations. Reinforcement learning from human feedback (RLHF) leverages human preference signals that are in the form of ranking of response pairs to perform this alignment. However, human preference on LLM outputs can come in much richer forms including natural language, which may provide detailed feedback on strengths and weaknesses of a given response. In this work we investigate data efficiency of modeling human feedback that is in natural language. Specifically, we fine-tune an open-source LLM, e.g., Falcon-40B-Instruct, on a relatively small amount (1000 records or even less) of human feedback in natural language in the form of critiques and revisions of responses. We show that this model is able to improve the quality of responses from even some of the strongest LLMs such as ChatGPT, BARD, and Vicuna, through critique and revision of those responses. For instance, through one iteration of revision of ChatGPT responses, the revised responses have 56.6% win rate over the original ones, and this win rate can be further improved to 65.9% after applying the revision for five iterations.

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

TL;DR

Abstract

Paper Structure (29 sections, 7 figures, 6 tables)

This paper contains 29 sections, 7 figures, 6 tables.

Introduction
Related Work
Alignment via Preference Modeling.
Alignment via Natural Language Feedback.
Is Critique Required in CnR?
Multi-Model Feedback Providers.
General Purpose Feedback in Natural Language.
Training vs. Few-Shot for Natural Language Feedback Setup.
CnR Data
CnR Model
Experiments
Results
Best CnR Setting
CnR Can Improve ChatGPT
CnR Can Improve Iteratively
...and 14 more sections

Figures (7)

Figure 1: Example of critique and revision annotation
Figure 2: Different CnR data preparation settings
Figure 3: Win rates between the original responses from Vicuna-13B, BARD, ChatGPT, and GPT4 on FastChat 80 queries and the revised responses by Falcon-I-CnR.
Figure 4: Win rates of revised responses by Falcon-40B-Instruct-CnR with respect to the original responses from ChatGPT over iterative revisions.
Figure 5: Win rate of the revised responses over the original ones for different base models fine-tuned on different numbers of CnR samples. Original responses are from instruction-tuned model X (such as GPTJ-Dolly in the left-most figure), while revised responses are from the corresponding revision model X-CnR that fine-tunes model X on the CnR data (e.g., GPTJ-Dolly-CnR.)
...and 2 more figures

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

TL;DR

Abstract

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

Authors

TL;DR

Abstract

Table of Contents

Figures (7)