LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

Lukas Helff; Felix Friedrich; Manuel Brack; Kristian Kersting; Patrick Schramowski

LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

Lukas Helff, Felix Friedrich, Manuel Brack, Kristian Kersting, Patrick Schramowski

TL;DR

LlavaGuard proposes an open, VLM-based framework to safeguard vision datasets and models by developing a flexible safety taxonomy, a multimodal safety dataset with guided rationales, and a scalable model suite (0.5B–7B). It introduces a policy-responsive prompt-response setup and evaluates robustness to policy variations with metrics like PER and PES. Empirical results show LlavaGuard outperforms SOTA safeguards and open baselines in accuracy and policy adaptability, while practical demonstrations on dataset auditing and model moderation highlight real-world impact. The work contributes an end-to-end, openly available framework for vision safety that can adapt to diverse regulatory settings and policies, though it acknowledges limitations related to annotation signals and policy scope.

Abstract

This paper introduces LlavaGuard, a suite of VLM-based vision safeguards that address the critical need for reliable guardrails in the era of large-scale data and models. To this end, we establish a novel open framework, describing a customizable safety taxonomy, data preprocessing, augmentation, and training setup. For teaching a VLM safeguard on safety, we further create a multimodal safety dataset with high-quality human expert annotations, where each image is labeled with a safety rating, category, and rationale. We also employ advanced augmentations to support context-specific assessments. The resulting LlavaGuard models, ranging from 0.5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies. In comprehensive experiments, LlavaGuard outperforms both state-of-the-art safeguards and VLMs in accuracy and in flexibly handling different policies. Additionally, we demonstrate LlavaGuard's performance in two real-world applications: large-scale dataset annotation and moderation of text-to-image models. We make our entire framework, including the dataset, model weights, and training code.

LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

TL;DR

Abstract

Paper Structure (39 sections, 2 equations, 13 figures, 6 tables)

This paper contains 39 sections, 2 equations, 13 figures, 6 tables.

Introduction
Background
Safety Audits.
Generative AI Risk Assessment and Mitigation.
LlavaGuard's Safety Taxonomy for Vision
Safety Categories
Risk Guidelines
Dataset Creation
Data Collection
Data Augmentation
Guided Rationales
Dataset Construction
LlavaGuard Model Suite
Prompt-Response Setup
Policy Responsiveness
...and 24 more sections

Figures (13)

Figure 1: LlavaGuard judges images for safety compliance to a policy, providing a safety rating, category, and rationale.
Figure 2: LlavaGuard provides safety reviews, including category, rationale, and rating. On the left, it assesses an SMID image from the test set under two policies. LlavaGuard demonstrates strong policy-following abilities by adapting to policy changes. The right shows evaluations for SMID crone2018TheSocio, X.com, and COCO lin2014microsoft images.
Figure 3: Category-wise analysis of safety performance. LlavaGuard shows consistent coverage of safety categories whereas other models exhibit either overall or category-specific limitations.
Figure 4: Dataset Audit. LlavaGuard applied to ImageNet (1.3M images). In summary, LlavaGuard successfully detects candidate images and categorizes them as un/safe according to its taxonomy. (a) reports quantitative results encompassing overall category detections as well as the portion classified as unsafe. The results are also split by category. (b) illustrates examples of images classified as unsafe, with the safety class shown in red and the ImageNet class shown in blue.
Figure 5: Safeguarding generative models. LlavaGuard applied to I2P (11k images generated with StableDiffusion-v1.5). In summary, LlavaGuard successfully detects synthetic candidate images and categorizes them as un/safe according to its taxonomy. (a) reports quantitative results encompassing overall category detections as well as the portion classified as unsafe. The results are also split by category. LlavaGuard performs well in the safety assessment of synthetic content. (b) illustrates examples of images classified as unsafe, with the safety category shown in red.
...and 8 more figures

LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

TL;DR

Abstract

LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)