Table of Contents
Fetching ...

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025

Zonghao Ying, Siyang Wu, Run Hao, Peng Ying, Shixuan Sun, Pengyu Chen, Junze Chen, Hao Du, Kaiwen Shen, Shangkun Wu, Jiwei Wei, Shiyuan He, Yang Yang, Xiaohai Xu, Ke Ma, Qianqian Xu, Qingming Huang, Shi Lin, Xun Wang, Changting Lin, Meng Han, Yilei Jiang, Siqi Lai, Yaozhi Zheng, Yifei Song, Xiangyu Yue, Zonglei Jing, Tianyuan Zhang, Zhilei Zhu, Aishan Liu, Jiakai Wang, Siyuan Liang, Xianglong Kong, Hainan Li, Junjie Mu, Haotong Qin, Yue Yu, Lei Chen, Felix Juefei-Xu, Qing Guo, Xinyun Chen, Yew Soon Ong, Xianglong Liu, Dawn Song, Alan Yuille, Philip Torr, Dacheng Tao

TL;DR

ATLAS 2025 targets safety vulnerabilities in multimodal large language models by organizing a two-phase adversarial testing challenge that combines white-box and black-box jailbreaks across image-text inputs. The study formalizes cross-modal jailbreak objectives, evaluates submissions via Attack Success Rate judged by an LLM, and exposes a spectrum of attack strategies—from flowchart-based visual prompts to reasoning-level manipulations. Key findings include near-perfect success with structured multimodal attacks, strong transferability across model families, and clear evidence of safety gaps that current defenses must address. The work offers benchmarks, methodological innovations, and practical guidance for strengthening safe multimodal AI systems, with implications for defense design, evaluation standards, and future competitions.

Abstract

Multimodal Large Language Models (MLLMs) have enabled transformative advancements across diverse applications but remain susceptible to safety threats, especially jailbreak attacks that induce harmful outputs. To systematically evaluate and improve their safety, we organized the Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025}. This technical report presents findings from the competition, which involved 86 teams testing MLLM vulnerabilities via adversarial image-text attacks in two phases: white-box and black-box evaluations. The competition results highlight ongoing challenges in securing MLLMs and provide valuable guidance for developing stronger defense mechanisms. The challenge establishes new benchmarks for MLLM safety evaluation and lays groundwork for advancing safer multimodal AI systems. The code and data for this challenge are openly available at https://github.com/NY1024/ATLAS_Challenge_2025.

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025

TL;DR

ATLAS 2025 targets safety vulnerabilities in multimodal large language models by organizing a two-phase adversarial testing challenge that combines white-box and black-box jailbreaks across image-text inputs. The study formalizes cross-modal jailbreak objectives, evaluates submissions via Attack Success Rate judged by an LLM, and exposes a spectrum of attack strategies—from flowchart-based visual prompts to reasoning-level manipulations. Key findings include near-perfect success with structured multimodal attacks, strong transferability across model families, and clear evidence of safety gaps that current defenses must address. The work offers benchmarks, methodological innovations, and practical guidance for strengthening safe multimodal AI systems, with implications for defense design, evaluation standards, and future competitions.

Abstract

Multimodal Large Language Models (MLLMs) have enabled transformative advancements across diverse applications but remain susceptible to safety threats, especially jailbreak attacks that induce harmful outputs. To systematically evaluate and improve their safety, we organized the Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025}. This technical report presents findings from the competition, which involved 86 teams testing MLLM vulnerabilities via adversarial image-text attacks in two phases: white-box and black-box evaluations. The competition results highlight ongoing challenges in securing MLLMs and provide valuable guidance for developing stronger defense mechanisms. The challenge establishes new benchmarks for MLLM safety evaluation and lays groundwork for advancing safer multimodal AI systems. The code and data for this challenge are openly available at https://github.com/NY1024/ATLAS_Challenge_2025.

Paper Structure

This paper contains 53 sections, 5 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Score distribution comparison between Phase I and Phase II. The box plots and scatter points respectively reflect the statistical range and individual variations in scores. Phase II shows increased variability and a lower mean score, suggesting greater task difficulty and more diverse attack effectiveness.
  • Figure 2: An example of image design in a jailbreak attack.
  • Figure 3: Example of a linear flowchart.
  • Figure 4: Overview of the Phase I.
  • Figure 5: Overview of the strategy for Phase II.
  • ...and 1 more figures