Safer-Instruct: Aligning Language Models with Automated Preference Data

Taiwei Shi; Kai Chen; Jieyu Zhao

Safer-Instruct: Aligning Language Models with Automated Preference Data

Taiwei Shi, Kai Chen, Jieyu Zhao

TL;DR

The paper tackles the high cost and limited diversity of human-derived RLHF preference data by introducing Safer-Instruct, a bottom-up pipeline that uses reversed instruction tuning, instruction induction, automatic filtering, and expert-response generation to autonomously produce large-scale safety preference data. Applied to a safety-focused case, it yields a 10k-sample Safer-Instruct dataset and demonstrates that finetuning Alpaca on this synthetic data improves harmlessness while maintaining competitive performance on helpfulness and downstream tasks. The approach reduces reliance on human annotation and is adaptable to other domains, with the authors releasing data and code to the community. Overall, Safer-Instruct offers a scalable, domain-agnostic method to align language models with automated preference data, advancing safer and more capable AI systems.

Abstract

Reinforcement learning from human feedback (RLHF) is a vital strategy for enhancing model capability in language models. However, annotating preference data for RLHF is a resource-intensive and creativity-demanding process, while existing automatic generation methods face limitations in data diversity and quality. In response, we present Safer-Instruct, a novel pipeline for automatically constructing large-scale preference data. Our approach leverages reversed instruction tuning, instruction induction, and expert model evaluation to efficiently generate high-quality preference data without human annotators. To verify the effectiveness of Safer-Instruct, we apply the pipeline to construct a safety preference dataset as a case study. Finetuning an Alpaca model on this synthetic dataset not only demonstrates improved harmlessness but also outperforms models fine-tuned on human-annotated safety preference data, all the while maintaining a competitive edge in downstream tasks. Importantly, our Safer-Instruct framework is versatile and can be applied to generate preference data across various domains, extending its utility beyond safety preferences. It addresses the challenges in preference data acquisition and advances the development of more capable and responsible AI systems. For dataset and code implementation, see https://github.com/uscnlp-lime/safer-instruct

Safer-Instruct: Aligning Language Models with Automated Preference Data

TL;DR

Abstract

Safer-Instruct: Aligning Language Models with Automated Preference Data

Authors

TL;DR

Abstract

Table of Contents

Figures (1)