Self-Directed Synthetic Dialogues and Revisions Technical Report

Nathan Lambert; Hailey Schoelkopf; Aaron Gokaslan; Luca Soldaini; Valentina Pyatkin; Louis Castricato

Self-Directed Synthetic Dialogues and Revisions Technical Report

Nathan Lambert, Hailey Schoelkopf, Aaron Gokaslan, Luca Soldaini, Valentina Pyatkin, Louis Castricato

TL;DR

The paper addresses the scarcity of open, multi-turn synthetic data for fine-tuning language models by introducing Self-Directed Synthetic Dialogues (SDSD), a procedurally generated, topic-principled dialogue dataset created from open models (DBRX, Llama 2 70B, Mistral Large). SDSD uses an initial plan as a system prompt to guide self-chat between two instances of the same model, and employs a critique–revision loop to produce a SDSD-Revisions (SDSD-R) dataset that encodes preference data. Building on RLHF and Constitutional AI, the authors detail the data-creation workflow, model prompts, and evaluation of dialogue properties (e.g., turn length, principle violations). The work demonstrates the feasibility of generating long-form, open-model synthetic data and discusses practical considerations, limitations, and future directions for employing SDSD/SDSD-R in open-model fine-tuning and alignment research.

Abstract

Synthetic data has become an important tool in the fine-tuning of language models to follow instructions and solve complex problems. Nevertheless, the majority of open data to date is often lacking multi-turn data and collected on closed models, limiting progress on advancing open fine-tuning methods. We introduce Self Directed Synthetic Dialogues (SDSD), an experimental dataset consisting of guided conversations of language models talking to themselves. The dataset consists of multi-turn conversations generated with DBRX, Llama 2 70B, and Mistral Large, all instructed to follow a conversation plan generated prior to the conversation. We also explore including principles from Constitutional AI and other related works to create synthetic preference data via revisions to the final conversation turn. We hope this work encourages further exploration in multi-turn data and the use of open models for expanding the impact of synthetic data.

Self-Directed Synthetic Dialogues and Revisions Technical Report

TL;DR

Abstract

Paper Structure (27 sections, 1 equation, 6 figures, 6 tables)

This paper contains 27 sections, 1 equation, 6 figures, 6 tables.

Introduction
Related work
Background
Reinforcement Learning from Human Feedback
Preference data collection
Reward model training
Optimizing the policy with RL
Constitutional AI
Self Directed Synthetic Dialogues (SDSD)
Dataset Creation
Dataset Analysis
Limitations and Lessons for Synthetic Data
Conclusion
Additional Results
SDSD prompts
...and 12 more sections

Figures (6)

Figure 1: An overview of the data generation process with Self Directed Synthetic Dialogues. First, Topics, Principles, and Goals are collected or generated. Next, the language model follows the plan for the conversation, acting as both sides of the dialogue with the same system prompt. The conversation continues until the plan is completed or a violation occurs, yielding more dialogues than revisions. The generating model notices the violation in text, and generates tokens indicating it has done so. When a violation of the principle occurs, a critique is used to instruct the language model on how to re-write the final answer into a preference pair of corrected-response and original response.
Figure 2: Sample user-system interaction captured generated with https://huggingface.co/NousResearch/Nous-Hermes-Llama2-70b. In this example, the model makes a minor mistake with the plan, erroneously copying the goal into the plan, but still arrives at a reasonable topic and executes the conversation. Additional random examples are included in Appendix \ref{['app:results']}.
Figure 3: The distribution of how often a given principle was violated in the dataset. In this figure and related tables, the count of violations can exceed the count of data-points because some conversations violate multiple principles. For examples of the top- and bottom-violated principles per model see Tab. \ref{['tab:principles_comparison']} and Tab. \ref{['tab:principles_comparison_bottom']} respectively.
Figure 4: Random example from the DBRX split.
Figure 5: Random example from the Llama split.
...and 1 more figures

Self-Directed Synthetic Dialogues and Revisions Technical Report

TL;DR

Abstract

Self-Directed Synthetic Dialogues and Revisions Technical Report

Authors

TL;DR

Abstract

Table of Contents

Figures (6)