Table of Contents
Fetching ...

MOSAIC: Composable Safety Alignment with Modular Control Tokens

Jingyu Peng, Hongyu Chen, Jiancheng Dong, Maolin Wang, Wenxi Li, Yuchen Li, Kai Zhang, Xiangyu Zhao

Abstract

Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility.

MOSAIC: Composable Safety Alignment with Modular Control Tokens

Abstract

Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility.
Paper Structure (27 sections, 9 equations, 7 figures, 3 tables)

This paper contains 27 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An illustration comparing two existing safety alignment paradigms and the proposed MOSAIC framework at inference time.
  • Figure 2: Overview of the proposed MOSAIC approach, showing the sampling strategy in the upper panel and the training objective in the lower panel.
  • Figure 3: Performance on Llama-3.1-8B under different numbers of control tokens per category.
  • Figure 4: Ablation Study. N-token uses only the control tokens of N subset categories during inference.
  • Figure 5: Performance on Llama-3.1-8B under different negative-to-positive ratio.
  • ...and 2 more figures