Table of Contents
Fetching ...

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Yuxuan Zhou, Yang Bai, Kuofeng Gao, Tao Dai, Shu-Tao Xia

TL;DR

JPRO introduces a novel four-agent, multi-phase framework for automated, black-box jailbreaking of vision-language models, addressing the limitations of prior white-box and pattern-based attacks. By coupling a Planner, Attacker, Modifier, and Verifier with two core modules—Tactic-Driven Seed Generation and Adaptive Optimization Loop—JPRO achieves high attack diversity and scalability, maintaining malicious intent across multi-turn interactions. Empirical results show JPRO significantly outperforms baselines in ASR across diverse proprietary and open-source VLMs and demonstrates notable transferability; diversity metrics confirm broad coverage of attack directions. The work provides practical insights for VLM safety evaluation and highlights the need for dynamic, hybrid-defense approaches to counter adaptive multimodal attacks. It also offers a formal upper-bound perspective on attack capabilities and discusses defense limitations against tactic-hybrid strategies, underscoring the importance of robust red-teaming in real-world deployments.

Abstract

The widespread application of large VLMs makes ensuring their secure deployment critical. While recent studies have demonstrated jailbreak attacks on VLMs, existing approaches are limited: they require either white-box access, restricting practicality, or rely on manually crafted patterns, leading to poor sample diversity and scalability. To address these gaps, we propose JPRO, a novel multi-agent collaborative framework designed for automated VLM jailbreaking. It effectively overcomes the shortcomings of prior methods in attack diversity and scalability. Through the coordinated action of four specialized agents and its two core modules: Tactic-Driven Seed Generation and Adaptive Optimization Loop, JPRO generates effective and diverse attack samples. Experimental results show that JPRO achieves over a 60\% attack success rate on multiple advanced VLMs, including GPT-4o, significantly outperforming existing methods. As a black-box attack approach, JPRO not only uncovers critical security vulnerabilities in multimodal models but also offers valuable insights for evaluating and enhancing VLM robustness.

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

TL;DR

JPRO introduces a novel four-agent, multi-phase framework for automated, black-box jailbreaking of vision-language models, addressing the limitations of prior white-box and pattern-based attacks. By coupling a Planner, Attacker, Modifier, and Verifier with two core modules—Tactic-Driven Seed Generation and Adaptive Optimization Loop—JPRO achieves high attack diversity and scalability, maintaining malicious intent across multi-turn interactions. Empirical results show JPRO significantly outperforms baselines in ASR across diverse proprietary and open-source VLMs and demonstrates notable transferability; diversity metrics confirm broad coverage of attack directions. The work provides practical insights for VLM safety evaluation and highlights the need for dynamic, hybrid-defense approaches to counter adaptive multimodal attacks. It also offers a formal upper-bound perspective on attack capabilities and discusses defense limitations against tactic-hybrid strategies, underscoring the importance of robust red-teaming in real-world deployments.

Abstract

The widespread application of large VLMs makes ensuring their secure deployment critical. While recent studies have demonstrated jailbreak attacks on VLMs, existing approaches are limited: they require either white-box access, restricting practicality, or rely on manually crafted patterns, leading to poor sample diversity and scalability. To address these gaps, we propose JPRO, a novel multi-agent collaborative framework designed for automated VLM jailbreaking. It effectively overcomes the shortcomings of prior methods in attack diversity and scalability. Through the coordinated action of four specialized agents and its two core modules: Tactic-Driven Seed Generation and Adaptive Optimization Loop, JPRO generates effective and diverse attack samples. Experimental results show that JPRO achieves over a 60\% attack success rate on multiple advanced VLMs, including GPT-4o, significantly outperforming existing methods. As a black-box attack approach, JPRO not only uncovers critical security vulnerabilities in multimodal models but also offers valuable insights for evaluating and enhancing VLM robustness.

Paper Structure

This paper contains 25 sections, 5 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: lllustration of different attacks. JPRO requires no white-box permissions and achieves higher diversity.
  • Figure 2: Overview of our proposed JPRO framwork. It consists of two phases: (a) Phase 1: Tactic-Driven Seed Generation Phase, in which the planner extracts and combines strategies from a predefined tactic library to form multiple attack directions . (b) Phase 2: Adaptive Optimization Loop. According to the attack directions generated by the planner, the attacker is responsible for generating specific image prompts and texts, and invoking the diffusion model for image generation. The modifier verifies the generated image-text pairs based on the directions. Finally, the verifier guides the attacker's next-round attack from both topic and risk perspectives until the attack succeeds.
  • Figure 4: Diversity analysis of JPRO. Compared with baselines, JPRO requires the minimum number of attack attempts to obtain 5 unique attacks, and the first 5 attack samples show significant differences.
  • Figure : (a) ASR w/o Modifier
  • Figure : (a) ASR w/o Modifier
  • ...and 2 more figures