MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability

Yanrui Du; Sendong Zhao; Danyang Zhao; Ming Ma; Yuhan Chen; Liangyu Huo; Qing Yang; Dongliang Xu; Bing Qin

MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability

Yanrui Du, Sendong Zhao, Danyang Zhao, Ming Ma, Yuhan Chen, Liangyu Huo, Qing Yang, Dongliang Xu, Bing Qin

TL;DR

The paper addresses the safety-usability trade-off in open-source LLMs by introducing MoGU, a framework that splits a base model into a highly usable Glad_resp and a highly safe Unwill_resp via LoRA, then dynamically routes between them with a dedicated router. By training the two responders on carefully constructed benign and malicious instruction pairs and optimizing a routing objective, MoGU learns when to prioritize safety vs. usability, particularly giving more weight to Unwill_resp for malicious prompts and to Glad_resp for benign prompts. Empirical results across multiple open-source LLMs show MoGU achieves robust safety improvements under red-team and jailbreak evaluations while preserving, or even enhancing, usability metrics compared to seven baseline defenses. The work demonstrates that dynamic, instruction-aware routing can effectively balance safety and practicality, offering a scalable approach to safer, more usable open-source LLM deployments.

Abstract

Large Language Models (LLMs) are increasingly deployed in various applications. As their usage grows, concerns regarding their safety are rising, especially in maintaining harmless responses when faced with malicious instructions. Many defense strategies have been developed to enhance the safety of LLMs. However, our research finds that existing defense strategies lead LLMs to predominantly adopt a rejection-oriented stance, thereby diminishing the usability of their responses to benign instructions. To solve this problem, we introduce the MoGU framework, designed to enhance LLMs' safety while preserving their usability. Our MoGU framework transforms the base LLM into two variants: the usable LLM and the safe LLM, and further employs dynamic routing to balance their contribution. When encountering malicious instructions, the router will assign a higher weight to the safe LLM to ensure that responses are harmless. Conversely, for benign instructions, the router prioritizes the usable LLM, facilitating usable and helpful responses. On various open-sourced LLMs, we compare multiple defense strategies to verify the superiority of our MoGU framework. Besides, our analysis provides key insights into the effectiveness of MoGU and verifies that our designed routing mechanism can effectively balance the contribution of each variant by assigning weights. Our work released the safer Llama2, Vicuna, Falcon, Dolphin, and Baichuan2.

MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability

TL;DR

Abstract

Paper Structure (41 sections, 11 equations, 5 figures, 12 tables)

This paper contains 41 sections, 11 equations, 5 figures, 12 tables.

Introduction
Related Work
Attack strategies
Red-team evaluation.
Jailbreak attack.
Defense Strategies
MoGU Framework
Training Data Preparation
Training Stage
The training of glad and unwilling responders.
The design and training of router.
Inference Stage
Main Experiments
Preliminary
LLMs.
...and 26 more sections

Figures (5)

Figure 1: An example to illustrate how the router assigns weights to Glad$_{resp}$ and Unwill$_{resp}$. The h_states and o_states represent the input vector and output vector respectively.
Figure 2: Overall framework of our MoGU.
Figure 3: The distribution of weights assigned by the router of Vicuna$_{7B}$.
Figure 4: The distribution of weights assigned by the router of Llama2$_{7B}$ and Falcon$_{7B}$.
Figure 5: In the figure, we present the results (ASR%) of LLMs under red team evaluations and various jailbreak attacks, with d$_{router}$ set at 128, 256, 512, and 1024. The "AVG." indicates the average defense performance. Lower ASR% values indicate better defense performance.

MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability

TL;DR

Abstract

MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability

Authors

TL;DR

Abstract

Table of Contents

Figures (5)