Table of Contents
Fetching ...

Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually

Mazal Bethany, Brandon Wherry, Nishant Vishwamitra, Peyman Najafirad

TL;DR

The paper tackles the problem of safe content moderation for images on social platforms by requiring both accurate, grounded rationales for obfuscation and minimal obfuscation of unsafe regions. It introduces ConditionalVLM, a vision language model conditioned on pre-trained unsafe image classifiers, to generate rationale grounded in domain-specific attributes. It also presents Counterfactual Subobject Explanations (CSE), which combine adaptive Bayesian segmentation, FullGrad attributions, and a greedy search to identify the smallest set of regions to obfuscate that alters the classifier output. Empirical results across sexually explicit, cyberbullying, and self-harm data demonstrate improved description quality and effective minimal obfuscation, offering practical benefits for moderators and investigators while preserving user safety.

Abstract

Social media platforms are being increasingly used by malicious actors to share unsafe content, such as images depicting sexual activity, cyberbullying, and self-harm. Consequently, major platforms use artificial intelligence (AI) and human moderation to obfuscate such images to make them safer. Two critical needs for obfuscating unsafe images is that an accurate rationale for obfuscating image regions must be provided, and the sensitive regions should be obfuscated (\textit{e.g.} blurring) for users' safety. This process involves addressing two key problems: (1) the reason for obfuscating unsafe images demands the platform to provide an accurate rationale that must be grounded in unsafe image-specific attributes, and (2) the unsafe regions in the image must be minimally obfuscated while still depicting the safe regions. In this work, we address these key issues by first performing visual reasoning by designing a visual reasoning model (VLM) conditioned on pre-trained unsafe image classifiers to provide an accurate rationale grounded in unsafe image attributes, and then proposing a counterfactual explanation algorithm that minimally identifies and obfuscates unsafe regions for safe viewing, by first utilizing an unsafe image classifier attribution matrix to guide segmentation for a more optimal subregion segmentation followed by an informed greedy search to determine the minimum number of subregions required to modify the classifier's output based on attribution score. Extensive experiments on uncurated data from social networks emphasize the efficacy of our proposed method. We make our code available at: https://github.com/SecureAIAutonomyLab/ConditionalVLM

Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually

TL;DR

The paper tackles the problem of safe content moderation for images on social platforms by requiring both accurate, grounded rationales for obfuscation and minimal obfuscation of unsafe regions. It introduces ConditionalVLM, a vision language model conditioned on pre-trained unsafe image classifiers, to generate rationale grounded in domain-specific attributes. It also presents Counterfactual Subobject Explanations (CSE), which combine adaptive Bayesian segmentation, FullGrad attributions, and a greedy search to identify the smallest set of regions to obfuscate that alters the classifier output. Empirical results across sexually explicit, cyberbullying, and self-harm data demonstrate improved description quality and effective minimal obfuscation, offering practical benefits for moderators and investigators while preserving user safety.

Abstract

Social media platforms are being increasingly used by malicious actors to share unsafe content, such as images depicting sexual activity, cyberbullying, and self-harm. Consequently, major platforms use artificial intelligence (AI) and human moderation to obfuscate such images to make them safer. Two critical needs for obfuscating unsafe images is that an accurate rationale for obfuscating image regions must be provided, and the sensitive regions should be obfuscated (\textit{e.g.} blurring) for users' safety. This process involves addressing two key problems: (1) the reason for obfuscating unsafe images demands the platform to provide an accurate rationale that must be grounded in unsafe image-specific attributes, and (2) the unsafe regions in the image must be minimally obfuscated while still depicting the safe regions. In this work, we address these key issues by first performing visual reasoning by designing a visual reasoning model (VLM) conditioned on pre-trained unsafe image classifiers to provide an accurate rationale grounded in unsafe image attributes, and then proposing a counterfactual explanation algorithm that minimally identifies and obfuscates unsafe regions for safe viewing, by first utilizing an unsafe image classifier attribution matrix to guide segmentation for a more optimal subregion segmentation followed by an informed greedy search to determine the minimum number of subregions required to modify the classifier's output based on attribution score. Extensive experiments on uncurated data from social networks emphasize the efficacy of our proposed method. We make our code available at: https://github.com/SecureAIAutonomyLab/ConditionalVLM
Paper Structure (14 sections, 11 equations, 2 figures, 4 tables)

This paper contains 14 sections, 11 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the proposed architecture. The initial module utilizes ConditionalVLM for classifying images as safe or unsafe, while the subsequent module proposes counterfactual visual explanations to identify and obfuscate the unsafe regions within the image.
  • Figure 2: Examples of segmentation methods on a cyberbullying image. From top to bottom: (1) BASS, (2) SLIC, (3) SAM.