Table of Contents
Fetching ...

Efficient LLM Moderation with Multi-Layer Latent Prototypes

Maciej Chrabąszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzciński, Sebastian Cygert

TL;DR

Efficient LLM Moderation with Multi-Layer Latent Prototypes (MLPM) presents a lightweight input moderation method that uses multi-layer latent representations and Mahalanobis-distance-based prototypes to assess prompt safety. By computing per-layer Gaussian discriminant classifiers and learning sparse aggregation weights across layers, MLPM achieves guard-level moderation performance with minimal inference overhead and good data efficiency. Extensive experiments across model families and datasets show MLPM often outperforms guard models and other latent methods, and can be integrated with output moderation to enhance end-to-end safety. The work demonstrates strong generalization, interpretability of layer-wise signals, and practical applicability to real-world LLM deployment, including reasoning models and MoE variants.

Abstract

Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

Efficient LLM Moderation with Multi-Layer Latent Prototypes

TL;DR

Efficient LLM Moderation with Multi-Layer Latent Prototypes (MLPM) presents a lightweight input moderation method that uses multi-layer latent representations and Mahalanobis-distance-based prototypes to assess prompt safety. By computing per-layer Gaussian discriminant classifiers and learning sparse aggregation weights across layers, MLPM achieves guard-level moderation performance with minimal inference overhead and good data efficiency. Extensive experiments across model families and datasets show MLPM often outperforms guard models and other latent methods, and can be integrated with output moderation to enhance end-to-end safety. The work demonstrates strong generalization, interpretability of layer-wise signals, and practical applicability to real-world LLM deployment, including reasoning models and MoE variants.

Abstract

Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

Paper Structure

This paper contains 42 sections, 9 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Our proposed Multi-Layer Prototype Moderator (MLPM) framework. During training, we compute class-conditional prototypes ($\mu_i, \Sigma_i$) based on last-token hidden representations ($h_{i,T}$), which we use to define a per-layer Gaussian Discriminant Analysis (GDA) classifier. We then learn sparse aggregation weights ($w_i$) over the GDA scores. During inference, we use the pre-trained GDA classifiers to compute classification scores from layers with non-zero weights ($|w_i|>0$), and produce a safety probability, $P(\text{unsafe})$, from their weighted aggregate. MLPM enables state-of-the-art performance with lightweight training and negligible inference overhead.
  • Figure 2: MLPM outperforms Guard models through effective scaling. Unlike fixed-size Guard models (represented by crosses), MLPM scales seamlessly with the backbone model, achieving superior harmfulness detection across diverse architectures.
  • Figure 3: a) Performance comparison of MLPM against other latent-based methods on In-Distribution (ID) and Out-Of-Distribution (OOD) sets. MLPM consistently outperforms baselines in both settings, demonstrating the efficacy of utilizing multiple layers. b) MLPM performs well even in limited data settings, offering reasonable effectiveness even in data-scarce scenarios. c) While pretrained representations prove adequate for in-distribution examples, they fail to generalize to out-of-distribution, unlike instruction models.
  • Figure 4: WGMix F1 obtained with MLPM for Qwen3 Instruct (-I) and Thinking (-T) models, using either the end of the prompt (Last Prompt Token) or the end of thinking token (EOT Token).
  • Figure 5: Automatic identification of safety-critical layers. a) MLPM achieves peak performance with strong regularization, utilizing a sparse subset of layers. b-c) Analysis of layer importance indicates that the distribution of safety representations varies between models: Mistral concentrates safety information in the middle layers, while OLMo2 in the final layers. Crucially, rather than relying on a manually chosen layer, MLPM automatically selects and uses multiple representations, aggregating signals from the most informative layers.
  • ...and 3 more figures