AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement

Zhexin Zhang; Leqi Lei; Junxiao Yang; Xijie Huang; Yida Lu; Shiyao Cui; Renmiao Chen; Qinglin Zhang; Xinyuan Wang; Hao Wang; Hao Li; Xianqi Lei; Chengwei Pan; Lei Sha; Hongning Wang; Minlie Huang

AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement

Zhexin Zhang, Leqi Lei, Junxiao Yang, Xijie Huang, Yida Lu, Shiyao Cui, Renmiao Chen, Qinglin Zhang, Xinyuan Wang, Hao Wang, Hao Li, Xianqi Lei, Chengwei Pan, Lei Sha, Hongning Wang, Minlie Huang

TL;DR

This paper presents AISafetyLab, a unified framework and toolkit for evaluating and improving AI safety by integrating attack, defense, and evaluation components with four auxiliary modules. It provides broad method coverage, a structured design, and support for both local and API-based models, demonstrated through Vicuna-focused experiments that reveal strengths of certain defenses and persistent gaps in evaluation robustness. The work includes a detailed architectural description (Attack, Defense, Evaluation, and auxiliaries), a public implementation, and practical usage guidance with code examples. Key findings highlight the effectiveness of inference-time defenses like Prompt Guard and robust evaluation challenges, underscoring the need for consistent benchmarks and extensible tooling to advance AI safety research. The publicly available AISafetyLab aims to accelerate systematic safety research and collaboration by offering an extensible, end-to-end platform for attackers, defenders, and evaluators alike.

Abstract

As AI models are increasingly deployed across diverse real-world scenarios, ensuring their safety remains a critical yet underexplored challenge. While substantial efforts have been made to evaluate and enhance AI safety, the lack of a standardized framework and comprehensive toolkit poses significant obstacles to systematic research and practical adoption. To bridge this gap, we introduce AISafetyLab, a unified framework and toolkit that integrates representative attack, defense, and evaluation methodologies for AI safety. AISafetyLab features an intuitive interface that enables developers to seamlessly apply various techniques while maintaining a well-structured and extensible codebase for future advancements. Additionally, we conduct empirical studies on Vicuna, analyzing different attack and defense strategies to provide valuable insights into their comparative effectiveness. To facilitate ongoing research and development in AI safety, AISafetyLab is publicly available at https://github.com/thu-coai/AISafetyLab, and we are committed to its continuous maintenance and improvement.

AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement

TL;DR

Abstract

AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)