Table of Contents
Fetching ...

Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection, Element, and Location in One-shot

Sheng Hang, Chaoxiang He, Hongsheng Hu, Hanqing Hu, Bin Benjamin Zhu, Shi-Feng Sun, Dawu Gu, Shuo Wang

TL;DR

We address the need for fine-grained malicious-image moderation by moving beyond image-level NSFW flags to identifying the specific objects and their locations that drive a policy violation. The authors propose a zero-shot pipeline that couples Segment Anything Model (SAM)–based segmentation with open-vocabulary vision-language model scoring, aggregating evidence across multiple segmenters to produce pixel-accurate masks and a toxicity heatmap. Key contributions include a mask-merging strategy with a principled scoring function, a VLM-based region scoring mechanism, and an ensemble defense against segmentation-targeted attacks, validated on a newly curated NSFW-Malicious dataset. The results show strong element-level recall and segment accuracy, robustness to adaptive attacks, and practical run-time, indicating readiness for integration into real moderation workflows.

Abstract

Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method's precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.

Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection, Element, and Location in One-shot

TL;DR

We address the need for fine-grained malicious-image moderation by moving beyond image-level NSFW flags to identifying the specific objects and their locations that drive a policy violation. The authors propose a zero-shot pipeline that couples Segment Anything Model (SAM)–based segmentation with open-vocabulary vision-language model scoring, aggregating evidence across multiple segmenters to produce pixel-accurate masks and a toxicity heatmap. Key contributions include a mask-merging strategy with a principled scoring function, a VLM-based region scoring mechanism, and an ensemble defense against segmentation-targeted attacks, validated on a newly curated NSFW-Malicious dataset. The results show strong element-level recall and segment accuracy, robustness to adaptive attacks, and practical run-time, indicating readiness for integration into real moderation workflows.

Abstract

Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method's precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.

Paper Structure

This paper contains 39 sections, 12 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overall architecture of our pipeline
  • Figure 2: Explainability Methods on Qwen2.5VL-7B (All Transformer Layers Overlay)
  • Figure 3: Drug image comparison of our method with direct query (Qwen2.5VL-7B)
  • Figure 4: Gory image comparison of our method with direct query (Qwen2.5VL-7B)
  • Figure 5: Porn image comparison of our method with direct query (Qwen2.5VL-7B)
  • ...and 5 more figures