Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning

Ximing Lu; Faeze Brahman; Peter West; Jaehun Jang; Khyathi Chandu; Abhilasha Ravichander; Lianhui Qin; Prithviraj Ammanabrolu; Liwei Jiang; Sahana Ramnath; Nouha Dziri; Jillian Fisher; Bill Yuchen Lin; Skyler Hallinan; Xiang Ren; Sean Welleck; Yejin Choi

Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning

Ximing Lu, Faeze Brahman, Peter West, Jaehun Jang, Khyathi Chandu, Abhilasha Ravichander, Lianhui Qin, Prithviraj Ammanabrolu, Liwei Jiang, Sahana Ramnath, Nouha Dziri, Jillian Fisher, Bill Yuchen Lin, Skyler Hallinan, Xiang Ren, Sean Welleck, Yejin Choi

TL;DR

<3-5 sentence high-level summary>

Abstract

While extreme-scale language models have demonstrated exceptional performance on a variety of language tasks, the degree of control over these language models through pure prompting can often be limited. Directly fine-tuning such language models can be effective for tailoring them, but it can be either extremely costly (e.g., GPT-3) or not even feasible for the broader community (e.g., GPT-4). We propose Inference-time Policy Adapters (IPA), which efficiently tailors a language model such as GPT-3 without fine-tuning it. IPA guides a large base model during decoding time through a lightweight policy adapter trained to optimize an arbitrary user objective with reinforcement learning. On five challenging text generation tasks, such as toxicity reduction and lexically constrained generation, IPA consistently brings significant improvements over off-the-shelf language models. It outperforms competitive baseline methods, sometimes even including expensive fine-tuning. In particular, tailoring GPT-2 with IPA can outperform GPT-3, while tailoring GPT-3 with IPA brings a major performance boost over GPT-3 (and sometimes even over GPT-4). Our promising results highlight the potential of IPA as a lightweight alternative to tailoring extreme-scale language models.

Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning

TL;DR

<3-5 sentence high-level summary>

Abstract

Paper Structure (58 sections, 5 equations, 6 figures, 20 tables)

This paper contains 58 sections, 5 equations, 6 figures, 20 tables.

Introduction
Background
Problem Setting
Preliminary: Tailoring LMs with RL
Inference-time Policy Adapters (IPA)
Policy Adaptation
Adapter Training with RL
Approximate Policy.
IPA at Inference Time.
Experiments
Toxicity Reduction
Datasets and Metrics.
Setup and Baselines
Results
Lexically Constrained Generation
...and 43 more sections

Figures (6)

Figure 1: Inference-time Policy Adapters (IPA) efficiently steer a large-scale language model (such as GPT-3) during decoding-time through a lightweight policy adapter trained to optimize any arbitrary user objective with reinforcement learning.
Figure 2: Performance of $\text{IPA}^{\text{-}}$ (blue line) with respect to the size of the adapter model (distill-GPT2, GPT2-small, GPT2-medium, GPT2-large, GPT2-XL) on top of a off-the-shelf GPT-3 as the base policy. The grey line denotes the performance of the off-the-shelf GPT-3.
Figure 3: Pairwise human evaluation in terms of overall quality for Open-ended Generation on XSum with off-the-shelf GPT2-XL (top) and GPT-3 (bottom) as the base policy to tailor.
Figure 4: Human evaluation layout on Amazon Mechanical Turk for Dialogue Sfaety Control
Figure 5: Human evaluation layout on Amazon Mechanical Turk for open-ended generation
...and 1 more figures

Theorems & Definitions (2)

Definition 1: Tailored policy
Definition 2: Approximate policy

Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning

TL;DR

Abstract

Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (2)