Table of Contents
Fetching ...

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Zixuan Xu, Tiancheng He, Huahui Yi, Kun Wang, Xi Chen, Gongli Xi, Qiankun Li, Kang Li, Yang Liu, Zhigang Zeng

TL;DR

SaFeR-ToolKit formalizes safety decision-making as a checkable protocol that trains a single policy with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO), where GRPO directly supervises tool usage beyond answer-level feedback.

Abstract

Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception $\to$ Reasoning $\to$ Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 $\to$ 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 $\to$ 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 $\to$ 59.21; 7B: 66.39 $\to$ 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

TL;DR

SaFeR-ToolKit formalizes safety decision-making as a checkable protocol that trains a single policy with a three-stage curriculum (SFT DPO GRPO), where GRPO directly supervises tool usage beyond answer-level feedback.

Abstract

Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception Reasoning Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT DPO GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 59.21; 7B: 66.39 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.
Paper Structure (80 sections, 22 equations, 7 figures, 8 tables)

This paper contains 80 sections, 22 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of SaFeR-ToolKit.(a) Motivating Example: Unlike baseline refusals, SaFeR-ToolKit uses dynamic ToolKit thinking (Perception $\rightarrow$ Reasoning $\rightarrow$ Decision) to generate safe and educational responses. (b) Data Statistics: Dataset distribution across training stages (SFT, DPO, GRPO) and tool categories. (c) Evaluation: Ablation studies demonstrating stage-wise improvements (left) and performance comparisons with SOTA baselines (right).
  • Figure 2: SaFeR-ToolKit overview and training pipeline. Given an image-question input, a planner selects a persona, a virtual tool subset, and a topology (linear/tree/mesh/shield/loop); the responder then produces a structured tool trace and final answer. SFT learns the trace format and basic tool usage, DPO improves tool selection and execution, and GRPO refines deeper tool-based reasoning.
  • Figure 2: General capability evaluation. We report accuracy across five benchmarks. The best and second-best results are highlighted.
  • Figure 3: Reward ablation onQwen2.5-VL-3B. $\square$: Base (DPO); $\heartsuit$: +basic ($R_{\mathrm{fmt}}$+$R_{\mathrm{sem}}$ w/o $s_{\mathrm{tool}}$); $\diamondsuit$: +basic+depth ($R_{\mathrm{dep}}$); $\spadesuit$: +basic+quality ($s_{\mathrm{tool}}$); $\clubsuit$: +basic+both.
  • Figure 4: Qualitative safety comparison.Left: Bullying scenario; Right: Deception scenario. Qwen2.5-VL-7B provides harmful instructions, while SaFeR-ToolKit employs tool-mediated reasoning to detect risks, refuse safely, and pivot to constructive guidance.
  • ...and 2 more figures