Table of Contents
Fetching ...

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Yuhao Sun, Zhuoer Xu, Shiwen Cui, Kun Yang, Lingyun Yu, Yongdong Zhang, Hongtao Xie

TL;DR

UpSafe℃ addresses safety concerns in large language models by introducing a dynamic, modular approach that couples training-time upcycling with inference-time control. It identifies safety-critical layers via a Safety Sensitivity Score, upcycles them into a sparse Mixture-of-Experts with a Soft Guardrail router, and applies a two-stage SFT to sharpen safety discrimination while preserving general capabilities. A Safety Temperature $ au\in[0,1]$ at inference yields a controllable Pareto frontier between safety and utility, demonstrated across multiple base models and scales. The results show robust safety improvements against harmful and jailbreak inputs with minimal degradation on broad knowledge and reasoning tasks, illustrating a practical path toward controllable, modular LLM safety.

Abstract

Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe$^\circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe$^\circ$C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

TL;DR

UpSafe℃ addresses safety concerns in large language models by introducing a dynamic, modular approach that couples training-time upcycling with inference-time control. It identifies safety-critical layers via a Safety Sensitivity Score, upcycles them into a sparse Mixture-of-Experts with a Soft Guardrail router, and applies a two-stage SFT to sharpen safety discrimination while preserving general capabilities. A Safety Temperature at inference yields a controllable Pareto frontier between safety and utility, demonstrated across multiple base models and scales. The results show robust safety improvements against harmful and jailbreak inputs with minimal degradation on broad knowledge and reasoning tasks, illustrating a practical path toward controllable, modular LLM safety.

Abstract

Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafeC, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafeC achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.

Paper Structure

This paper contains 23 sections, 17 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Overall framework of our $\textsc{UpSafe℃}$. We first scan the pretrained LLM to identify safety-critical layers, then upcycle them with safety experts through a two-stage SFT strategy, and finally apply a safety temperature at inference to dynamically balance safety and utility.
  • Figure 2: (a) t-SNE visualization comparing safety-critical layer with the other layer in Llama3.1-8B-Instruct. The safety-critical layer display more discriminative representations between harmful and benign inputs, supporting our safety-critical layer scan strategy. (b) Scan results of Llama3.1-8B-Instruct. We plot the SS-Score across layers and highlight the top-3 safety-critical layers.
  • Figure 3: Top: theoretical activation probabilities of general and safety experts under varying safety temperatures. Bottom: actual expert scores observed during inference, illustrating how the routing behaves in practice.
  • Figure 4: Safety–utility trade-off curves under different safety temperature $\tau$, with points color-coded by temperature and the Pareto frontier highlighted.
  • Figure 5: Routing distribution of harmful vs. benign prompts.
  • ...and 7 more figures