Table of Contents
Fetching ...

TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations

Xiuyuan Chen, Jian Zhao, Yuxiang He, Yuan Xun, Xinwei Liu, Yanshu Li, Huilin Zhou, Wei Cai, Ziyan Shi, Yuchen Yuan, Tianle Zhang, Chi Zhang, Xuelong Li

TL;DR

TeleAI-Safety addresses the fragmented landscape of LLM jailbreak safety by delivering a unified, modular framework and standardized benchmark. It combines 19 attack methods, 29 defenses, and 19 evaluation methods with a 342-sample, 12-category risk dataset tested across 14 models, including self-developed Morpheus and RADAR components. The work reveals systematic vulnerabilities, safety-utility trade-offs, and the generalization and reliability challenges of current defenses and evaluators. By enabling configurable, reproducible assessments, TeleAI-Safety offers a scalable foundation for robust, enterprise-grade LLM safety research and deployment.

Abstract

While the deployment of large language models (LLMs) in high-value industries continues to expand, the systematic assessment of their safety against jailbreak and prompt-based attacks remains insufficient. Existing safety evaluation benchmarks and frameworks are often limited by an imbalanced integration of core components (attack, defense, and evaluation methods) and an isolation between flexible evaluation frameworks and standardized benchmarking capabilities. These limitations hinder reliable cross-study comparisons and create unnecessary overhead for comprehensive risk assessment. To address these gaps, we present TeleAI-Safety, a modular and reproducible framework coupled with a systematic benchmark for rigorous LLM safety evaluation. Our framework integrates a broad collection of 19 attack methods (including one self-developed method), 29 defense methods, and 19 evaluation methods (including one self-developed method). With a curated attack corpus of 342 samples spanning 12 distinct risk categories, the TeleAI-Safety benchmark conducts extensive evaluations across 14 target models. The results reveal systematic vulnerabilities and model-specific failure cases, highlighting critical trade-offs between safety and utility, and identifying potential defense patterns for future optimization. In practical scenarios, TeleAI-Safety can be flexibly adjusted with customized attack, defense, and evaluation combinations to meet specific demands. We release our complete code and evaluation results to facilitate reproducible research and establish unified safety baselines.

TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations

TL;DR

TeleAI-Safety addresses the fragmented landscape of LLM jailbreak safety by delivering a unified, modular framework and standardized benchmark. It combines 19 attack methods, 29 defenses, and 19 evaluation methods with a 342-sample, 12-category risk dataset tested across 14 models, including self-developed Morpheus and RADAR components. The work reveals systematic vulnerabilities, safety-utility trade-offs, and the generalization and reliability challenges of current defenses and evaluators. By enabling configurable, reproducible assessments, TeleAI-Safety offers a scalable foundation for robust, enterprise-grade LLM safety research and deployment.

Abstract

While the deployment of large language models (LLMs) in high-value industries continues to expand, the systematic assessment of their safety against jailbreak and prompt-based attacks remains insufficient. Existing safety evaluation benchmarks and frameworks are often limited by an imbalanced integration of core components (attack, defense, and evaluation methods) and an isolation between flexible evaluation frameworks and standardized benchmarking capabilities. These limitations hinder reliable cross-study comparisons and create unnecessary overhead for comprehensive risk assessment. To address these gaps, we present TeleAI-Safety, a modular and reproducible framework coupled with a systematic benchmark for rigorous LLM safety evaluation. Our framework integrates a broad collection of 19 attack methods (including one self-developed method), 29 defense methods, and 19 evaluation methods (including one self-developed method). With a curated attack corpus of 342 samples spanning 12 distinct risk categories, the TeleAI-Safety benchmark conducts extensive evaluations across 14 target models. The results reveal systematic vulnerabilities and model-specific failure cases, highlighting critical trade-offs between safety and utility, and identifying potential defense patterns for future optimization. In practical scenarios, TeleAI-Safety can be flexibly adjusted with customized attack, defense, and evaluation combinations to meet specific demands. We release our complete code and evaluation results to facilitate reproducible research and establish unified safety baselines.

Paper Structure

This paper contains 43 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The distribution of harmful data samples across 12 risk categories in the TeleAI-Safety dataset after the completion of the curation process.
  • Figure 2: The framework of TeleAI-Safety, which integrates 19 attack methods, 29 defense methods, and 19 evaluation methods. The blue highlights indicate our latest self-developed methods.
  • Figure 3: Safety performance of black-box models across different risk categories (using 1-ASR as the metric).
  • Figure 4: Safety performance of white-box models across different risk categories (using 1-ASR as the metric).