Table of Contents
Fetching ...

SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

Youngjin Na, Sangheon Jeong, Youngwan Lee, Jian Lee, Dawoon Jeong, Youngman Kim

TL;DR

SIA addresses safety challenges in vision-language models where harmful intent can emerge from the interaction of an image and accompanying text. It deploys a training-free, three-stage pipeline: (i) captioning to convert visual content into text, (ii) few-shot chain-of-thought prompting to infer latent intent and generate reasoning, and (iii) intent-conditioned response generation to produce safe outputs. The approach demonstrates strong safety gains across SIUO, HoliSafe, and MM-SafetyBench without model fine-tuning, including notable improvements such as Gemma3-IT-4B's safety on SIUO rising from 28.14% to 62.28%. By leveraging explicit intent reasoning with pretrained components, SIA offers a practical, scalable solution for safer multimodal interaction in real-world deployments.

Abstract

With the growing deployment of Vision-Language Models (VLMs) in real-world applications, previously overlooked safety risks are becoming increasingly evident. In particular, seemingly innocuous multimodal inputs can combine to reveal harmful intent, leading to unsafe model outputs. While multimodal safety has received increasing attention, existing approaches often fail to address such latent risks, especially when harmfulness arises only from the interaction between modalities. We propose SIA (Safety via Intent Awareness), a training-free, intent-aware safety framework that proactively detects harmful intent in multimodal inputs and uses it to guide the generation of safe responses. SIA follows a three-stage process: (1) visual abstraction via captioning; (2) intent inference through few-shot chain-of-thought (CoT) prompting; and (3) intent-conditioned response generation. By dynamically adapting to the implicit intent inferred from an image-text pair, SIA mitigates harmful outputs without extensive retraining. Extensive experiments on safety benchmarks, including SIUO, MM-SafetyBench, and HoliSafe, show that SIA consistently improves safety and outperforms prior training-free methods.

SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

TL;DR

SIA addresses safety challenges in vision-language models where harmful intent can emerge from the interaction of an image and accompanying text. It deploys a training-free, three-stage pipeline: (i) captioning to convert visual content into text, (ii) few-shot chain-of-thought prompting to infer latent intent and generate reasoning, and (iii) intent-conditioned response generation to produce safe outputs. The approach demonstrates strong safety gains across SIUO, HoliSafe, and MM-SafetyBench without model fine-tuning, including notable improvements such as Gemma3-IT-4B's safety on SIUO rising from 28.14% to 62.28%. By leveraging explicit intent reasoning with pretrained components, SIA offers a practical, scalable solution for safer multimodal interaction in real-world deployments.

Abstract

With the growing deployment of Vision-Language Models (VLMs) in real-world applications, previously overlooked safety risks are becoming increasingly evident. In particular, seemingly innocuous multimodal inputs can combine to reveal harmful intent, leading to unsafe model outputs. While multimodal safety has received increasing attention, existing approaches often fail to address such latent risks, especially when harmfulness arises only from the interaction between modalities. We propose SIA (Safety via Intent Awareness), a training-free, intent-aware safety framework that proactively detects harmful intent in multimodal inputs and uses it to guide the generation of safe responses. SIA follows a three-stage process: (1) visual abstraction via captioning; (2) intent inference through few-shot chain-of-thought (CoT) prompting; and (3) intent-conditioned response generation. By dynamically adapting to the implicit intent inferred from an image-text pair, SIA mitigates harmful outputs without extensive retraining. Extensive experiments on safety benchmarks, including SIUO, MM-SafetyBench, and HoliSafe, show that SIA consistently improves safety and outperforms prior training-free methods.

Paper Structure

This paper contains 17 sections, 4 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: An intent-aware response example on the SIUO benchmark illustrates the effectiveness of our approach (SIA) by comparing outputs of the Mistral-Small3.2 model mistral2023technical. When conditioned solely on the image and query without intent, the vision-language model (VLM) produces an unsafe baseline response. In contrast, our proposed SIA framework, which integrates few-shot intent inference, generates a safe and contextually appropriate response.
  • Figure 2: Overall architecture of our proposed Safety via Intent Awareness framework (SIA). The framework consists of three sequential stages: (1) Visual abstraction via captioning, (2) Intent inference using few-shot prompting, and (3) Safe response generation conditioned on the inferred intent.
  • Figure 3: Category-wise safety rates on SIUO benchmark. SIA is compared against other methods across categories.
  • Figure 4: Prompt used to infer subtle or harmful intent in multimodal questions.
  • Figure 5: Prompt used to guide response generation.
  • ...and 5 more figures