Table of Contents
Fetching ...

Teaching Models to Improve on Tape

Liat Bezalel, Eyal Orgad, Amir Globerson

TL;DR

This work introduces an RL framework for teaching models to use rewards according to their ability to satisfy constraints, by simulating interaction sessions, and rewarding the model according to its ability to satisfy the constraints.

Abstract

Large Language Models (LLMs) often struggle when prompted to generate content under specific constraints. However, in such cases it is often easy to check whether these constraints are satisfied or violated. Recent works have shown that LLMs can benefit from such "corrective feedback". Here we claim that this skill of LLMs can be significantly enhanced via training. We introduce an RL framework for teaching models to use such rewards, by simulating interaction sessions, and rewarding the model according to its ability to satisfy the constraints. We refer to our method as CORGI (Controlled Generation with RL for Guided Interaction), and evaluate it on a variety of controlled generation tasks using unlabeled training data. We find that CORGI consistently outperforms the baseline reinforcement learning method that does not incorporate conversational feedback. Furthermore, CORGI's interactive framework enables meta-learning, allowing the LLM to generalize better to guided interaction in new tasks. Our results clearly show that conversational optimization, when combined with reinforcement learning, significantly improves the effectiveness of LLMs in controlled generation contexts.

Teaching Models to Improve on Tape

TL;DR

This work introduces an RL framework for teaching models to use rewards according to their ability to satisfy constraints, by simulating interaction sessions, and rewarding the model according to its ability to satisfy the constraints.

Abstract

Large Language Models (LLMs) often struggle when prompted to generate content under specific constraints. However, in such cases it is often easy to check whether these constraints are satisfied or violated. Recent works have shown that LLMs can benefit from such "corrective feedback". Here we claim that this skill of LLMs can be significantly enhanced via training. We introduce an RL framework for teaching models to use such rewards, by simulating interaction sessions, and rewarding the model according to its ability to satisfy the constraints. We refer to our method as CORGI (Controlled Generation with RL for Guided Interaction), and evaluate it on a variety of controlled generation tasks using unlabeled training data. We find that CORGI consistently outperforms the baseline reinforcement learning method that does not incorporate conversational feedback. Furthermore, CORGI's interactive framework enables meta-learning, allowing the LLM to generalize better to guided interaction in new tasks. Our results clearly show that conversational optimization, when combined with reinforcement learning, significantly improves the effectiveness of LLMs in controlled generation contexts.

Paper Structure

This paper contains 39 sections, 2 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: An example of the CORGI setup. We consider a dialogue between a generator and a critique. Here the generator is tasked with completing a given sentence in precisely four words, with the final word being "first". The critique evaluates the responses of the generator, providing both feedback and a score (illustrated as a star rating). The LLM receives a reward based on the highest score assigned by the critique throughout the dialogue history. To prioritize improvement on the more challenging length constraint, we set the constraint weights to 80% for length and 20% for the last word constraint.
  • Figure 2: Meta-learning examples. The figure shows responses of three models on two prompts: one for the Panagram task and one for the Clustering task. The three models are Vanilla Llama, RL-NoFB and CORGI. The latter two are trained on multiple source-tasks jointly. It can be seen that in these instances CORGI arrives at a correct output after several iterations, whereas the other models get stuck repeating a suboptimal solution.
  • Figure 3: Multi-task Performance on the Llama-3-Specific Target Tasks. The results show that CORGI significantly benefits from transfer learning, outperforming both the Vanilla-Llama and RL-NoFB configurations.
  • Figure 5: Ablation experiment studying the use of binary feedback. The figure shows a Vanilla Llama model and a CORGI model that use only binary feedback in interaction (green). Also shown in blue is the CORGI model that uses full feedback.
  • Figure 6: Llama-2 results on source tasks, when split by task, show that both RL-NoFB and CORGI training outperform the Vanilla-Llama baseline. Notably, CORGI training consistently surpasses both the RL-NoFB and Vanilla-Llama baselines across nearly all tasks in both single-task and multi-task training settings.
  • ...and 10 more figures