Efficient LLM Moderation with Multi-Layer Latent Prototypes

Maciej Chrabąszcz; Filip Szatkowski; Bartosz Wójcik; Jan Dubiński; Tomasz Trzciński; Sebastian Cygert

Efficient LLM Moderation with Multi-Layer Latent Prototypes

Maciej Chrabąszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzciński, Sebastian Cygert

TL;DR

Efficient LLM Moderation with Multi-Layer Latent Prototypes (MLPM) presents a lightweight input moderation method that uses multi-layer latent representations and Mahalanobis-distance-based prototypes to assess prompt safety. By computing per-layer Gaussian discriminant classifiers and learning sparse aggregation weights across layers, MLPM achieves guard-level moderation performance with minimal inference overhead and good data efficiency. Extensive experiments across model families and datasets show MLPM often outperforms guard models and other latent methods, and can be integrated with output moderation to enhance end-to-end safety. The work demonstrates strong generalization, interpretability of layer-wise signals, and practical applicability to real-world LLM deployment, including reasoning models and MoE variants.

Abstract

Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

Efficient LLM Moderation with Multi-Layer Latent Prototypes

TL;DR

Abstract

Efficient LLM Moderation with Multi-Layer Latent Prototypes

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)