Table of Contents
Fetching ...

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities

Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, Chandan Singh

TL;DR

OmniGuard tackles the challenge of detecting harmful prompts across languages and modalities by exploiting universal internal representations in LLMs/MLLMs. It introduces the Universality Score (U-Score) to locate language- and modality-agnostic layers and trains a lightweight classifier on those embeddings, reusing generation-time representations to avoid guard-model overhead. Across multilingual text, image, and audio benchmarks, OmniGuard achieves state-of-the-art or near-state-of-the-art accuracy with substantial data efficiency and dramatic inference-speed gains. The work demonstrates strong cross-language and cross-modal robustness, rapid adaptation with few examples, and significant practical potential for scalable AI safety moderation. Limitations include dependence on open models for access to embeddings and possible degradation on unseen models or domains.

Abstract

The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose Omniguard, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. Omniguard improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, Omniguard is also very efficient ($\approx\!120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities

TL;DR

OmniGuard tackles the challenge of detecting harmful prompts across languages and modalities by exploiting universal internal representations in LLMs/MLLMs. It introduces the Universality Score (U-Score) to locate language- and modality-agnostic layers and trains a lightweight classifier on those embeddings, reusing generation-time representations to avoid guard-model overhead. Across multilingual text, image, and audio benchmarks, OmniGuard achieves state-of-the-art or near-state-of-the-art accuracy with substantial data efficiency and dramatic inference-speed gains. The work demonstrates strong cross-language and cross-modal robustness, rapid adaptation with few examples, and significant practical potential for scalable AI safety moderation. Limitations include dependence on open models for access to embeddings and possible degradation on unseen models or domains.

Abstract

The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose Omniguard, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. Omniguard improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, Omniguard is also very efficient ( faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

Paper Structure

This paper contains 44 sections, 1 equation, 4 figures, 10 tables.

Figures (4)

  • Figure 1: OmniGuard builds a harmfulness classifier that operates on internal representations of an LLM (or MLLM). OmniGuard uses a custom metric (U-Score) to identify representations that generalize across languages and modalities. At inference time, OmniGuard re-uses the embeddings from the LLM/MLLM being used for generation, and thereby completely avoids the overhead of passing the inputs through a separate guard model for safety moderation.
  • Figure 2: The U-Score across different layers for different modalities. (A) Different layers of the model Llama3.3-70B-Instruct for different languages. (B) The Cross-Model Alignment Score at different layers of the model (Molmo-7B) for similarity between images and captions. The highest values are obtained with at layers 21-25, indicating better alignment between images and their text captions at these layers. (C) The Cross-Model Alignment Score at different layers of the model (Llama-Omni 8B) for similarity between audios and transcriptions. The highest values are obtained with at layers 20-23, indicating better alignment between audios and their text transcriptions at these layers.
  • Figure 3: Accuracy of detecting harmful prompts in a few-shot setting. As few-shot examples are provided, OmniGuard quickly achieves near-perfect accuracy, despite the attacks being quite different from its training data (e.g. without any few-shot examples, OmniGuard's accuracy is close to 50% ). In contrast, the guard model baselines improve their accuracy slowly in a few-shot setting, despite sometimes having seen similar code attacks in their training data. Accuracies are averaged over 50 random sets of few-shot examples; error bars show the standard error of the mean.
  • Figure 4: Comparison of accuracy of classifying sentiments in various languages compared to detecting harmful prompts in those languages using OmniGuard. In both cases the LLM is Llama3.3-70B-Instruct.