Table of Contents
Fetching ...

Llama Guard 3-1B-INT4: Compact and Efficient Safeguard for Human-AI Conversations

Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, Kimish Patel, Zechun Liu, Changsheng Zhao, Yangyang Shi, Tijmen Blankevoort, Mahesh Pasupuleti, Bilge Soran, Zacharie Delpierre Coudert, Rachad Alao, Raghuraman Krishnamoorthi, Vikas Chandra

TL;DR

The paper tackles the challenge of deploying robust safety guards for human-AI conversations on resource-constrained devices. It introduces Llama Guard 3-1B-INT4, a compact guard model derived from Llama Guard that is aggressively pruned and quantized, enabling on-device inference with throughput of $30$ tokens/s and time-to-first-token ≤ $2.5$ s on commodity Android CPUs while retaining competitive safety performance. The authors present a multi-step compression pipeline—pruning decoder blocks and MLP width, quantization-aware training to $4$-bit weights and $8$-bit activations, unembedding-layer pruning, and distillation from a larger guard model—that yields a final ~0.4GB model. Empirical results on multilingual safety benchmarks show that the INT4 model matches or surpasses the larger 3-1B in many languages and outperforms GPT-4 on several metrics, highlighting the practicality of efficient, on-device safety moderation for mobile applications; limitations include reliance on pretraining data and potential vulnerabilities to adversarial prompts. Overall, the work demonstrates a viable, scalable path to safe, on-device guard systems for conversational AI.

Abstract

This paper presents Llama Guard 3-1B-INT4, a compact and efficient Llama Guard model, which has been open-sourced to the community during Meta Connect 2024. We demonstrate that Llama Guard 3-1B-INT4 can be deployed on resource-constrained devices, achieving a throughput of at least 30 tokens per second and a time-to-first-token of 2.5 seconds or less on a commodity Android mobile CPU. Notably, our experiments show that Llama Guard 3-1B-INT4 attains comparable or superior safety moderation scores to its larger counterpart, Llama Guard 3-1B, despite being approximately 7 times smaller in size (440MB).

Llama Guard 3-1B-INT4: Compact and Efficient Safeguard for Human-AI Conversations

TL;DR

The paper tackles the challenge of deploying robust safety guards for human-AI conversations on resource-constrained devices. It introduces Llama Guard 3-1B-INT4, a compact guard model derived from Llama Guard that is aggressively pruned and quantized, enabling on-device inference with throughput of tokens/s and time-to-first-token ≤ s on commodity Android CPUs while retaining competitive safety performance. The authors present a multi-step compression pipeline—pruning decoder blocks and MLP width, quantization-aware training to -bit weights and -bit activations, unembedding-layer pruning, and distillation from a larger guard model—that yields a final ~0.4GB model. Empirical results on multilingual safety benchmarks show that the INT4 model matches or surpasses the larger 3-1B in many languages and outperforms GPT-4 on several metrics, highlighting the practicality of efficient, on-device safety moderation for mobile applications; limitations include reliance on pretraining data and potential vulnerabilities to adversarial prompts. Overall, the work demonstrates a viable, scalable path to safe, on-device guard systems for conversational AI.

Abstract

This paper presents Llama Guard 3-1B-INT4, a compact and efficient Llama Guard model, which has been open-sourced to the community during Meta Connect 2024. We demonstrate that Llama Guard 3-1B-INT4 can be deployed on resource-constrained devices, achieving a throughput of at least 30 tokens per second and a time-to-first-token of 2.5 seconds or less on a commodity Android mobile CPU. Notably, our experiments show that Llama Guard 3-1B-INT4 attains comparable or superior safety moderation scores to its larger counterpart, Llama Guard 3-1B, despite being approximately 7 times smaller in size (440MB).

Paper Structure

This paper contains 13 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Llama Guard output classification example.
  • Figure 2: Exporting and lowering quantized model for ExecuTorch runtime.
  • Figure 3: Visualization of the compression pipeline.