Table of Contents
Fetching ...

OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data

Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, Yasiru Ratnayake

TL;DR

OpenBezoar demonstrates that small open models can achieve competitive instruction-following performance by combining three open data-generation schemes (LaMini, Evol-Instruct, Orca) with GPT-4 proxy filtering, followed by QLoRA fine-tuning and direct preference optimization (DPO). The authors release OpenBezoar-SFT, OpenBezoar-HH-RLHF-SFT, and OpenBezoar-HH-RLHF-DPO checkpoints along with their datasets and code, enabling open replication. Evaluations across LM Eval Harness and MT-Bench show consistent gains over the base model and competitive standings among open 3B-scale models, illustrating the practical viability of cost-efficient, open pipelines for instruction-tuning. Limitations include dependence on closed-model evaluators (GPT-4, Claude-2.1) and relatively small synthetic datasets, pointing to future work on larger open data programs and fully open evaluation tools.

Abstract

Instruction fine-tuning pretrained LLMs for diverse downstream tasks has demonstrated remarkable success and has captured the interest of both academics and practitioners. To ensure such fine-tuned LLMs align with human preferences, techniques such as RLHF and DPO have emerged. At the same time, there is increasing interest in smaller parameter counts for models. In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models. In this recipe: We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model under three schemes based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a seed dataset) and Orca (with the Flan Collection as a seed dataset), then filter these generations using GPT-4 as a human proxy. We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme. The resulting checkpoint is further fine-tuned with a subset of the HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to obtain the final checkpoint. Evaluation is done with the LM Eval Harness tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with Claude 2.1, with the finding that the final checkpoint, "OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at the 3B parameter scale, even outperforming the top model in one of the categories on the Huggingface Open LLM Leaderboard. We release "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" checkpoints, alongside our generated datasets on HuggingFace at https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc and our codebase at https://bitbucket.org/paladinanalytics/workspace/projects/OP.

OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data

TL;DR

OpenBezoar demonstrates that small open models can achieve competitive instruction-following performance by combining three open data-generation schemes (LaMini, Evol-Instruct, Orca) with GPT-4 proxy filtering, followed by QLoRA fine-tuning and direct preference optimization (DPO). The authors release OpenBezoar-SFT, OpenBezoar-HH-RLHF-SFT, and OpenBezoar-HH-RLHF-DPO checkpoints along with their datasets and code, enabling open replication. Evaluations across LM Eval Harness and MT-Bench show consistent gains over the base model and competitive standings among open 3B-scale models, illustrating the practical viability of cost-efficient, open pipelines for instruction-tuning. Limitations include dependence on closed-model evaluators (GPT-4, Claude-2.1) and relatively small synthetic datasets, pointing to future work on larger open data programs and fully open evaluation tools.

Abstract

Instruction fine-tuning pretrained LLMs for diverse downstream tasks has demonstrated remarkable success and has captured the interest of both academics and practitioners. To ensure such fine-tuned LLMs align with human preferences, techniques such as RLHF and DPO have emerged. At the same time, there is increasing interest in smaller parameter counts for models. In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models. In this recipe: We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model under three schemes based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a seed dataset) and Orca (with the Flan Collection as a seed dataset), then filter these generations using GPT-4 as a human proxy. We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme. The resulting checkpoint is further fine-tuned with a subset of the HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to obtain the final checkpoint. Evaluation is done with the LM Eval Harness tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with Claude 2.1, with the finding that the final checkpoint, "OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at the 3B parameter scale, even outperforming the top model in one of the categories on the Huggingface Open LLM Leaderboard. We release "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" checkpoints, alongside our generated datasets on HuggingFace at https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc and our codebase at https://bitbucket.org/paladinanalytics/workspace/projects/OP.
Paper Structure (34 sections, 7 equations, 27 figures, 9 tables)

This paper contains 34 sections, 7 equations, 27 figures, 9 tables.

Figures (27)

  • Figure 1: An example of an instruction generation prompt based on three random examples from databricks-dolly-15k
  • Figure 2: Response generation prompt used
  • Figure 3: An in-depth evolving prompt used to add constraints to a random instruction in databricks-dolly-15k
  • Figure 4: An in-breadth evolving prompt based on a random instruction in databricks-dolly-15k
  • Figure 5: The prompt template used for response generation in the evol-instruct dataset generation process
  • ...and 22 more figures