Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models

Yi Luo; Zhenghao Lin; Yuhao Zhang; Jiashuo Sun; Chen Lin; Chengjin Xu; Xiangdong Su; Yelong Shen; Jian Guo; Yeyun Gong

Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models

Yi Luo, Zhenghao Lin, Yuhao Zhang, Jiashuo Sun, Chen Lin, Chengjin Xu, Xiangdong Su, Yelong Shen, Jian Guo, Yeyun Gong

TL;DR

This paper introduces Guide-Align, a guideline-oriented framework that uses a safety-trained model to generate a comprehensive library of input-specific guidelines and a retrieval model to map new inputs to relevant guidelines. By guiding LLMs with retrieved guidelines during inference—and optionally fine-tuning a base model on aligned outputs to create Labrador—the approach improves both safety and output quality. Evaluations across three benchmarks show substantial gains in alignment and safety, with Labrador (a 13B model) outperforming GPT-3.5-turbo and even surpassing GPT-4 in certain alignment tasks. The work also discusses limitations, including reliance on a safety-trained LLM and cross-linguistic applicability, and provides a public alignment dataset to foster broader research in value-aligned AI.

Abstract

Large Language Models (LLMs) exhibit impressive capabilities but also present risks such as biased content generation and privacy issues. One of the current alignment techniques includes principle-driven integration, but it faces challenges arising from the imprecision of manually crafted rules and inadequate risk perception in models without safety training. To address these, we introduce Guide-Align, a two-stage approach. Initially, a safety-trained model identifies potential risks and formulates specific guidelines for various inputs, establishing a comprehensive library of guidelines and a model for input-guidelines retrieval. Subsequently, the retrieval model correlates new inputs with relevant guidelines, which guide LLMs in response generation to ensure safe and high-quality outputs, thereby aligning with human values. An additional optional stage involves fine-tuning a model with well-aligned datasets generated through the process implemented in the second stage. Our method customizes guidelines to accommodate diverse inputs, thereby enhancing the fine-grainedness and comprehensiveness of the guideline library. Furthermore, it incorporates safety expertise from a safety-trained LLM through a lightweight retrieval model. We evaluate our approach on three benchmarks, demonstrating significant improvements in LLM security and quality. Notably, our fine-tuned model, Labrador, even at 13 billion parameters, outperforms GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities.

Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models

TL;DR

Abstract

Paper Structure (35 sections, 9 figures, 7 tables)

This paper contains 35 sections, 9 figures, 7 tables.

Introduction
Limitations of Manually Written Rules
Insufficient Risk Perception
Our Method: Guide-Align
Guideline Library Construction and Retrieval Model Training
Inference
Fine-tuning (Optional)
Experiment
Experiment Setup
The Statistic Information of Training Set And Guideline Library
Baselines and Labrador
Benchmarks and Results
Do_Not_Answer
HHH_Alignment
Vicuna_Benchmark
...and 20 more sections

Figures (9)

Figure 1: Framework of Guide-Align. (1) Guideline Library Construction and Retrieval Model Training ($\xrightarrow{}$) : Using a safety-trained model (GPT-3.5-turbo in our paper), we discern the safety of input data within the training dataset and generate corresponding guidelines. Subsequently, we create a guideline library and train an input-guideline retrieval model. (2) Inference ($\xrightarrow{}$) : For a new input, the retrieval model retrieves the top N relevant guidelines. These guidelines are then deduplicated based on similarity to obtain k (k<=N) guidelines. They are combined with the initial input for LLM to produce secure, high-quality responses. Fine-tuning (Optional)($\xrightarrow{}$) : Using an open-source dataset, we follow the inference process for inputs, generate corresponding outputs, join them with initial inputs to create an alignment dataset, and use it to fine-tune the base model, referred as Labrador.
Figure 2: An example of a safety-related input and its corresponding guidelines.
Figure 3: The questions in 8 typical safety scenarios (inner circle) and their top utilized guidelines (outer circle). The figures exclusively illustrate the keywords segment of the guidelines.
Figure 4: Harmful response distribution across the five risk areas. The five risk areas: I. Information Hazards; II. Malicious Uses; III. Discrimination, Exclusion, Toxicity, Hateful, Offensive; IV. Misinformation Harms; V. Human–chatbot Interaction Harms.
Figure 5: Comparison of responses generated by different methods and LLMs on Vicuna_Benchmark. (a): Vicuna+Guidelines vs. Vicuna. (b): GPT-3.5-turbo+Guidelines vs. GPT-3.5-turbo. (c): Labrador(ours) vs. Vicuna. (d): Labrador(ours) vs. GPT-3.5-turbo. In each experimental set, "Win", "Tie" and "Lose" refer to the outcomes on the left relative to the right of the "vs.".
...and 4 more figures

Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models

TL;DR

Abstract

Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)