Table of Contents
Fetching ...

LAB: Large-Scale Alignment for ChatBots

Shivchander Sudalairaj, Abhishek Bhandwaldar, Aldo Pareja, Kai Xu, David D. Cox, Akash Srivastava

TL;DR

This work addresses the scalability bottleneck of aligning large language models by removing dependence on costly human annotation and GPT-4-derived data. It proposes LAB, a framework that combines taxonomy-guided synthetic data generation with a two-phase, replay-buffered instruction-tuning regime to expand capabilities while avoiding catastrophic forgetting. Empirical results on open-base models like Llama-2-13b and Mistral-7B show LAB achieving competitive, and in some cases superior, performance across MT-Bench, MMLU, and related benchmarks compared to human- and GPT-4–generated baselines. The approach demonstrates a scalable, cost-effective path for robust instruction-following and knowledge retention applicable to a broad range of applications.

Abstract

This work introduces LAB (Large-scale Alignment for chatBots), a novel methodology designed to overcome the scalability challenges in the instruction-tuning phase of large language model (LLM) training. Leveraging a taxonomy-guided synthetic data generation process and a multi-phase tuning framework, LAB significantly reduces reliance on expensive human annotations and proprietary models like GPT-4. We demonstrate that LAB-trained models can achieve competitive performance across several benchmarks compared to models trained with traditional human-annotated or GPT-4 generated synthetic data. Thus offering a scalable, cost-effective solution for enhancing LLM capabilities and instruction-following behaviors without the drawbacks of catastrophic forgetting, marking a step forward in the efficient training of LLMs for a wide range of applications.

LAB: Large-Scale Alignment for ChatBots

TL;DR

This work addresses the scalability bottleneck of aligning large language models by removing dependence on costly human annotation and GPT-4-derived data. It proposes LAB, a framework that combines taxonomy-guided synthetic data generation with a two-phase, replay-buffered instruction-tuning regime to expand capabilities while avoiding catastrophic forgetting. Empirical results on open-base models like Llama-2-13b and Mistral-7B show LAB achieving competitive, and in some cases superior, performance across MT-Bench, MMLU, and related benchmarks compared to human- and GPT-4–generated baselines. The approach demonstrates a scalable, cost-effective path for robust instruction-following and knowledge retention applicable to a broad range of applications.

Abstract

This work introduces LAB (Large-scale Alignment for chatBots), a novel methodology designed to overcome the scalability challenges in the instruction-tuning phase of large language model (LLM) training. Leveraging a taxonomy-guided synthetic data generation process and a multi-phase tuning framework, LAB significantly reduces reliance on expensive human annotations and proprietary models like GPT-4. We demonstrate that LAB-trained models can achieve competitive performance across several benchmarks compared to models trained with traditional human-annotated or GPT-4 generated synthetic data. Thus offering a scalable, cost-effective solution for enhancing LLM capabilities and instruction-following behaviors without the drawbacks of catastrophic forgetting, marking a step forward in the efficient training of LLMs for a wide range of applications.
Paper Structure (15 sections, 4 figures, 3 tables)

This paper contains 15 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the LAB alignment method. Starting from the taxonomy root, data are curated in each top-level groups and examples in the leaf nodes are used by the synthetic data generators to generate orders of magnitude data for the phased-training step for instruct-tuning.
  • Figure 2: Intuition of how taxonomy-driven sampling produces diverse set of synthetic data and hence improve the data used to train student model across the task domain. Figure \ref{['fig:intuition-input']} shows how taxonomy-driven sampling leads to an input distribution with wide support and distinct modes while self-instruct gives an smooth input distribution. Figure \ref{['fig:intuition-output']} shows the consequence using inputs in generating synthetic data: teacher model will focus its own dominant modes if the input is smooth but focus on each task better if the inputs are also concentrated on each task.
  • Figure 3: Instruction Generator prompt template
  • Figure 4: Instruction-response Evaluation template