Table of Contents
Fetching ...

Lightweight Safety Classification Using Pruned Language Models

Mason Sawtell, Tula Masterman, Sandi Besen, Jim Brown

TL;DR

The paper tackles content safety and prompt injection classification for LLMs by introducing Layer Enhanced Classification (LEC), which trains a Penalized Logistic Regression (PLR) classifier on the hidden state of an optimally chosen intermediate transformer layer. The PLR classifier uses a parameter count equal to the hidden-state size, as low as 769, enabling a lightweight yet powerful approach that often outperforms GPT-4o and task-specific baselines with fewer than 100 labeled examples. The method generalizes across general-purpose and special-purpose architectures, and can be deployed either integrated into the LLM inference path or as an independent feature extractor, applying to both binary and multi-class content safety and prompt-injection tasks. The findings suggest that robust feature extraction is an inherent property of many transformer architectures and that pruned, small LLMs can serve effectively as salvaged feature extractors for fast, scalable safety classification.

Abstract

In this paper, we introduce a novel technique for content safety and prompt injection classification for Large Language Models. Our technique, Layer Enhanced Classification (LEC), trains a Penalized Logistic Regression (PLR) classifier on the hidden state of an LLM's optimal intermediate transformer layer. By combining the computational efficiency of a streamlined PLR classifier with the sophisticated language understanding of an LLM, our approach delivers superior performance surpassing GPT-4o and special-purpose models fine-tuned for each task. We find that small general-purpose models (Qwen 2.5 sizes 0.5B, 1.5B, and 3B) and other transformer-based architectures like DeBERTa v3 are robust feature extractors allowing simple classifiers to be effectively trained on fewer than 100 high-quality examples. Importantly, the intermediate transformer layers of these models typically outperform the final layer across both classification tasks. Our results indicate that a single general-purpose LLM can be used to classify content safety, detect prompt injections, and simultaneously generate output tokens. Alternatively, these relatively small LLMs can be pruned to the optimal intermediate layer and used exclusively as robust feature extractors. Since our results are consistent on different transformer architectures, we infer that robust feature extraction is an inherent capability of most, if not all, LLMs.

Lightweight Safety Classification Using Pruned Language Models

TL;DR

The paper tackles content safety and prompt injection classification for LLMs by introducing Layer Enhanced Classification (LEC), which trains a Penalized Logistic Regression (PLR) classifier on the hidden state of an optimally chosen intermediate transformer layer. The PLR classifier uses a parameter count equal to the hidden-state size, as low as 769, enabling a lightweight yet powerful approach that often outperforms GPT-4o and task-specific baselines with fewer than 100 labeled examples. The method generalizes across general-purpose and special-purpose architectures, and can be deployed either integrated into the LLM inference path or as an independent feature extractor, applying to both binary and multi-class content safety and prompt-injection tasks. The findings suggest that robust feature extraction is an inherent property of many transformer architectures and that pruned, small LLMs can serve effectively as salvaged feature extractors for fast, scalable safety classification.

Abstract

In this paper, we introduce a novel technique for content safety and prompt injection classification for Large Language Models. Our technique, Layer Enhanced Classification (LEC), trains a Penalized Logistic Regression (PLR) classifier on the hidden state of an LLM's optimal intermediate transformer layer. By combining the computational efficiency of a streamlined PLR classifier with the sophisticated language understanding of an LLM, our approach delivers superior performance surpassing GPT-4o and special-purpose models fine-tuned for each task. We find that small general-purpose models (Qwen 2.5 sizes 0.5B, 1.5B, and 3B) and other transformer-based architectures like DeBERTa v3 are robust feature extractors allowing simple classifiers to be effectively trained on fewer than 100 high-quality examples. Importantly, the intermediate transformer layers of these models typically outperform the final layer across both classification tasks. Our results indicate that a single general-purpose LLM can be used to classify content safety, detect prompt injections, and simultaneously generate output tokens. Alternatively, these relatively small LLMs can be pruned to the optimal intermediate layer and used exclusively as robust feature extractors. Since our results are consistent on different transformer architectures, we infer that robust feature extraction is an inherent capability of most, if not all, LLMs.

Paper Structure

This paper contains 20 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Visualization of a hybrid black-box model.
  • Figure 2: LEC performance of select layers on binary content safety classification for Qwen 2.5 0.5B Instruct, Llama Guard 3 1B, and Llama Guard 3 8B.
  • Figure 3: LEC performance of Qwen 2.5 0.5B Instruct on all three levels of the multi-class content safety dataset.
  • Figure 4: Performance of select layers on prompt injection classification for both general-purpose Qwen 2.5 0.5B Instruct and DeBERTa-v3-Prompt-Injection-v2.
  • Figure 5: LEC performance at each layer of the DeBERTa-v3-Prompt-Injection-v2 model for the prompt injection task.
  • ...and 8 more figures