Table of Contents
Fetching ...

Hyperbolic Safety-Aware Vision-Language Models

Tobia Poppi, Tejaswi Kasarla, Pascal Mettes, Lorenzo Baraldi, Rita Cucchiara

TL;DR

HySAC reframes unsafe content in vision-language models as a hierarchical, safety-aware problem by embedding safe and unsafe concepts in hyperbolic space using entailment cones. It combines hyperbolic contrastive learning with safety entailment to position safe content closer to the origin and unsafe content farther away, enabling dynamic traversals that redirect unsafe queries toward safe alternatives. The approach yields improved safety-aware retrieval, robustness on NSFW datasets, and a usable NSFW classifier by-product, while preserving performance on safe content. This work advances practical content moderation for VLMs by offering interpretable geometry-driven control over multimodal outputs and retrievals, with potential for integration into downstream generation systems.

Abstract

Addressing the retrieval of unsafe content from vision-language models such as CLIP is an important step towards real-world integration. Current efforts have relied on unlearning techniques that try to erase the model's knowledge of unsafe concepts. While effective in reducing unwanted outputs, unlearning limits the model's capacity to discern between safe and unsafe content. In this work, we introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space. We propose to encode safe and unsafe content as an entailment hierarchy, where both are placed in different regions of hyperbolic space. Our HySAC, Hyperbolic Safety-Aware CLIP, employs entailment loss functions to model the hierarchical and asymmetrical relations between safe and unsafe image-text pairs. This modelling, ineffective in standard vision-language models due to their reliance on Euclidean embeddings, endows the model with awareness of unsafe content, enabling it to serve as both a multimodal unsafe classifier and a flexible content retriever, with the option to dynamically redirect unsafe queries toward safer alternatives or retain the original output. Extensive experiments show that our approach not only enhances safety recognition but also establishes a more adaptable and interpretable framework for content moderation in vision-language models. Our source code is available at https://github.com/aimagelab/HySAC.

Hyperbolic Safety-Aware Vision-Language Models

TL;DR

HySAC reframes unsafe content in vision-language models as a hierarchical, safety-aware problem by embedding safe and unsafe concepts in hyperbolic space using entailment cones. It combines hyperbolic contrastive learning with safety entailment to position safe content closer to the origin and unsafe content farther away, enabling dynamic traversals that redirect unsafe queries toward safe alternatives. The approach yields improved safety-aware retrieval, robustness on NSFW datasets, and a usable NSFW classifier by-product, while preserving performance on safe content. This work advances practical content moderation for VLMs by offering interpretable geometry-driven control over multimodal outputs and retrievals, with potential for integration into downstream generation systems.

Abstract

Addressing the retrieval of unsafe content from vision-language models such as CLIP is an important step towards real-world integration. Current efforts have relied on unlearning techniques that try to erase the model's knowledge of unsafe concepts. While effective in reducing unwanted outputs, unlearning limits the model's capacity to discern between safe and unsafe content. In this work, we introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space. We propose to encode safe and unsafe content as an entailment hierarchy, where both are placed in different regions of hyperbolic space. Our HySAC, Hyperbolic Safety-Aware CLIP, employs entailment loss functions to model the hierarchical and asymmetrical relations between safe and unsafe image-text pairs. This modelling, ineffective in standard vision-language models due to their reliance on Euclidean embeddings, endows the model with awareness of unsafe content, enabling it to serve as both a multimodal unsafe classifier and a flexible content retriever, with the option to dynamically redirect unsafe queries toward safer alternatives or retain the original output. Extensive experiments show that our approach not only enhances safety recognition but also establishes a more adaptable and interpretable framework for content moderation in vision-language models. Our source code is available at https://github.com/aimagelab/HySAC.

Paper Structure

This paper contains 29 sections, 19 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of our approach. HySAC builds a hyperbolic embedding that manages content safety through an entailment hierarchy. Unsafe text and images are projected to dedicated regions of hyperbolic space, allowing for safety-aware retrieval and classification.
  • Figure 2: Distributions of embedding distances from the root. We embed all ViSU training samples and visualize their distance distribution from the root. While CLIP and Safe-CLIP do not separate between texts and images, MERU does. HySAC, instead, also differentiates between safe and unsafe content.
  • Figure 3: Qualitative traversal results. HySAC traverses towards the root feature, retrieving the top-1 text at each interpolation point. This traversal effectively transitions from unsafe to safe captions, demonstrating the model's ability to ensure safety-aware content retrieval.
  • Figure 4: Distributions of embedding distances from the root. Comparison of the distance distributions of Euclidean and hyperbolic embeddings from the root. Euclidean version of HySAC does not separate between safe and unsafe content, while HySAC does.
  • Figure 5: Traversals from unsafe image queries towards safe captions. We present qualitative results of HySAC, showing the traversals from unsafe image queries toward the root feature. Interpolation points along this path are used as new queries to retrieve captions from a pool of both safe and unsafe texts.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 3.1: Lorentzian distance
  • Definition 3.2: Exponential map