Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF

Chen Zheng; Ke Sun; Hang Wu; Chenguang Xi; Xun Zhou

Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF

Chen Zheng, Ke Sun, Hang Wu, Chenguang Xi, Xun Zhou

TL;DR

The conversational abilities of Mistral-Plus were significantly improved, indicating a substantial advancement over traditional SFT models in both safety and user preference alignment.

Abstract

In recent advancements in Conversational Large Language Models (LLMs), a concerning trend has emerged, showing that many new base LLMs experience a knowledge reduction in their foundational capabilities following Supervised Fine-Tuning (SFT). This process often leads to issues such as forgetting or a decrease in the base model's abilities. Moreover, fine-tuned models struggle to align with user preferences, inadvertently increasing the generation of toxic outputs when specifically prompted. To overcome these challenges, we adopted an innovative approach by completely bypassing SFT and directly implementing Harmless Reinforcement Learning from Human Feedback (RLHF). Our method not only preserves the base model's general capabilities but also significantly enhances its conversational abilities, while notably reducing the generation of toxic outputs. Our approach holds significant implications for fields that demand a nuanced understanding and generation of responses, such as customer service. We applied this methodology to Mistral, the most popular base model, thereby creating Mistral-Plus. Our validation across 11 general tasks demonstrates that Mistral-Plus outperforms similarly sized open-source base models and their corresponding instruct versions. Importantly, the conversational abilities of Mistral-Plus were significantly improved, indicating a substantial advancement over traditional SFT models in both safety and user preference alignment.

Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF

TL;DR

The conversational abilities of Mistral-Plus were significantly improved, indicating a substantial advancement over traditional SFT models in both safety and user preference alignment.

Abstract

Paper Structure (21 sections, 4 equations, 5 figures, 6 tables)

This paper contains 21 sections, 4 equations, 5 figures, 6 tables.

Introduction
Related Works
Safety Issues in LLM
Reinforcement Learning from Human Feedback
Model Description
Helpful and Harmless Reward Model
Mistral-Plus: Direct RLHF in Conversational LLM
Actor Model Learning
Important Training Trick: Optimizing RLHF for Concise Response Generation
Experiments
Experimental Setup
General Task Evaluation
Results
Analysis
Mistral-Plus on General language Understanding and Reasoning
...and 6 more sections

Figures (5)

Figure 1: Comparison of our proposed Mistral-Plus with various LLMs on machine moderation tasks. The Blue box represents the LLM base model, a Green Box indicates the Supervised Fine-Tuning (SFT) model, and an Orange box represents the Reinforcement Learning from Human Feedback (RLHF) model. The LLM model outputs are evaluated across three distinct categories: General Ability, Answer Correctness, and Safety. Note that both the Mistral RLHF model and our Mistral-Plus model utilize the same Helpfulness& Harmlessness dataset.
Figure 2: Score Comparsions between different LLMs.
Figure 3: Comparative Case Study in the MT-Bench Multi-Turn Task.
Figure 4: Bad word generation probablity on Mistral-Instruct and Mistral-Plus. The x-axis represents different intermittent layers, y-axis shows token probability.
Figure 5: Bad word generation probablity on Mistral-Instruct and Mistral-Plus. x-axis represents 5 bad words, while y-axis shows probability of bad word output.

Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF

TL;DR

Abstract

Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF

Authors

TL;DR

Abstract

Table of Contents

Figures (5)