Lightweight Safety Classification Using Pruned Language Models
Mason Sawtell, Tula Masterman, Sandi Besen, Jim Brown
TL;DR
The paper tackles content safety and prompt injection classification for LLMs by introducing Layer Enhanced Classification (LEC), which trains a Penalized Logistic Regression (PLR) classifier on the hidden state of an optimally chosen intermediate transformer layer. The PLR classifier uses a parameter count equal to the hidden-state size, as low as 769, enabling a lightweight yet powerful approach that often outperforms GPT-4o and task-specific baselines with fewer than 100 labeled examples. The method generalizes across general-purpose and special-purpose architectures, and can be deployed either integrated into the LLM inference path or as an independent feature extractor, applying to both binary and multi-class content safety and prompt-injection tasks. The findings suggest that robust feature extraction is an inherent property of many transformer architectures and that pruned, small LLMs can serve effectively as salvaged feature extractors for fast, scalable safety classification.
Abstract
In this paper, we introduce a novel technique for content safety and prompt injection classification for Large Language Models. Our technique, Layer Enhanced Classification (LEC), trains a Penalized Logistic Regression (PLR) classifier on the hidden state of an LLM's optimal intermediate transformer layer. By combining the computational efficiency of a streamlined PLR classifier with the sophisticated language understanding of an LLM, our approach delivers superior performance surpassing GPT-4o and special-purpose models fine-tuned for each task. We find that small general-purpose models (Qwen 2.5 sizes 0.5B, 1.5B, and 3B) and other transformer-based architectures like DeBERTa v3 are robust feature extractors allowing simple classifiers to be effectively trained on fewer than 100 high-quality examples. Importantly, the intermediate transformer layers of these models typically outperform the final layer across both classification tasks. Our results indicate that a single general-purpose LLM can be used to classify content safety, detect prompt injections, and simultaneously generate output tokens. Alternatively, these relatively small LLMs can be pruned to the optimal intermediate layer and used exclusively as robust feature extractors. Since our results are consistent on different transformer architectures, we infer that robust feature extraction is an inherent capability of most, if not all, LLMs.
