Table of Contents
Fetching ...

Suri: Multi-constraint Instruction Following for Long-form Text Generation

Chau Minh Pham, Simeng Sun, Mohit Iyyer

TL;DR

This work proposes Instructional ORPO (I-ORPO), an alignment method based on the ORPO algorithm that obtains negative feedback from synthetically corrupted instructions generated by an LLM.

Abstract

Existing research on instruction following largely focuses on tasks with simple instructions and short responses. In this work, we explore multi-constraint instruction following for generating long-form text. We create Suri, a dataset with 20K human-written long-form texts paired with LLM-generated backtranslated instructions that contain multiple complex constraints. Because of prohibitive challenges associated with collecting human preference judgments on long-form texts, preference-tuning algorithms such as DPO are infeasible in our setting; thus, we propose Instructional ORPO (I-ORPO), an alignment method based on the ORPO algorithm. Instead of receiving negative feedback from dispreferred responses, I-ORPO obtains negative feedback from synthetically corrupted instructions generated by an LLM. Using Suri, we perform supervised and I-ORPO fine-tuning on Mistral-7b-Instruct-v0.2. The resulting models, Suri-SFT and Suri-I-ORPO, generate significantly longer texts (~5K tokens) than base models without significant quality deterioration. Our human evaluation shows that while both SFT and I-ORPO models satisfy most constraints, Suri-I-ORPO generations are generally preferred for their coherent and informative incorporation of the constraints. We release our code at https://github.com/chtmp223/suri.

Suri: Multi-constraint Instruction Following for Long-form Text Generation

TL;DR

This work proposes Instructional ORPO (I-ORPO), an alignment method based on the ORPO algorithm that obtains negative feedback from synthetically corrupted instructions generated by an LLM.

Abstract

Existing research on instruction following largely focuses on tasks with simple instructions and short responses. In this work, we explore multi-constraint instruction following for generating long-form text. We create Suri, a dataset with 20K human-written long-form texts paired with LLM-generated backtranslated instructions that contain multiple complex constraints. Because of prohibitive challenges associated with collecting human preference judgments on long-form texts, preference-tuning algorithms such as DPO are infeasible in our setting; thus, we propose Instructional ORPO (I-ORPO), an alignment method based on the ORPO algorithm. Instead of receiving negative feedback from dispreferred responses, I-ORPO obtains negative feedback from synthetically corrupted instructions generated by an LLM. Using Suri, we perform supervised and I-ORPO fine-tuning on Mistral-7b-Instruct-v0.2. The resulting models, Suri-SFT and Suri-I-ORPO, generate significantly longer texts (~5K tokens) than base models without significant quality deterioration. Our human evaluation shows that while both SFT and I-ORPO models satisfy most constraints, Suri-I-ORPO generations are generally preferred for their coherent and informative incorporation of the constraints. We release our code at https://github.com/chtmp223/suri.
Paper Structure (45 sections, 7 equations, 7 figures, 20 tables)

This paper contains 45 sections, 7 equations, 7 figures, 20 tables.

Figures (7)

  • Figure 1: Our work consists of two stages. First, we construct the Suri dataset using gold responses sampled from three existing datasets that include creative writing and open web text, along with backtranslated instruction $x_w$ and corrupted instruction $x_l$. Second, we fine-tune Mistral-7B-Instruct-v0.2 on Suri, resulting in two variations: Suri-I-ORPO (via I-ORPO) and Suri-SFT (via supervised fine-tuning).
  • Figure 2: Average percentage of different constraint types within each instruction. The left figure categorizes the constraints based on their content, and the right figure refers to constraint scopes.
  • Figure 3: ORPO training curve. Left figure documents the log probability of the chosen and rejected prompts over 3 epochs. Right figure shows the log probability of the response given the chosen and rejected prompts over 3 epochs. A divergence between $\text{logps}(y|x_w)$ and $\text{logps}(y|x_l)$ is observed after 0.5 training epoch.
  • Figure 4: Average number of tokens in generations from baseline open-source models (Llama-3-8B-Instruct, Mixtral-8x7B-Instruct-v0.1, Mistral-7B-Instruct-v0.2) and our fine-tuned models (Suri-I-ORPO, Suri-SFT).
  • Figure 5: Average percentage of 5-gram repetitions before and after 2,048 tokens in each generation from I-ORPO and SFT models.
  • ...and 2 more figures