Table of Contents
Fetching ...

AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

Tim Beyer, Jonas Dornbusch, Jakob Steimle, Moritz Ladenburger, Leo Schwinn, Stephan Günnemann

TL;DR

AdversariaLLM tackles the fragmentation in LLM safety evaluation by delivering a unified, modular toolbox centered on reproducibility, correctness, and extensibility. It combines twelve adversarial attack algorithms, seven benchmark datasets, and JudgeZoo for standardized judgment, with open-weight LLM access via Hugging Face. Key contributions include corrected implementations, comprehensive coverage across attack types and datasets, resource-aware budgeting, per-step and distributional robustness evaluation, and robust reproducibility via complete run metadata. The framework aims to enable transparent, comparable, and scalable LLM safety research, with practical impact on cross-study reproducibility and benchmarking.

Abstract

The rapid expansion of research on Large Language Model (LLM) safety and robustness has produced a fragmented and oftentimes buggy ecosystem of implementations, datasets, and evaluation methods. This fragmentation makes reproducibility and comparability across studies challenging, hindering meaningful progress. To address these issues, we introduce AdversariaLLM, a toolbox for conducting LLM jailbreak robustness research. Its design centers on reproducibility, correctness, and extensibility. The framework implements twelve adversarial attack algorithms, integrates seven benchmark datasets spanning harmfulness, over-refusal, and utility evaluation, and provides access to a wide range of open-weight LLMs via Hugging Face. The implementation includes advanced features for comparability and reproducibility such as compute-resource tracking, deterministic results, and distributional evaluation techniques. \name also integrates judging through the companion package JudgeZoo, which can also be used independently. Together, these components aim to establish a robust foundation for transparent, comparable, and reproducible research in LLM safety.

AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

TL;DR

AdversariaLLM tackles the fragmentation in LLM safety evaluation by delivering a unified, modular toolbox centered on reproducibility, correctness, and extensibility. It combines twelve adversarial attack algorithms, seven benchmark datasets, and JudgeZoo for standardized judgment, with open-weight LLM access via Hugging Face. Key contributions include corrected implementations, comprehensive coverage across attack types and datasets, resource-aware budgeting, per-step and distributional robustness evaluation, and robust reproducibility via complete run metadata. The framework aims to enable transparent, comparable, and scalable LLM safety research, with practical impact on cross-study reproducibility and benchmarking.

Abstract

The rapid expansion of research on Large Language Model (LLM) safety and robustness has produced a fragmented and oftentimes buggy ecosystem of implementations, datasets, and evaluation methods. This fragmentation makes reproducibility and comparability across studies challenging, hindering meaningful progress. To address these issues, we introduce AdversariaLLM, a toolbox for conducting LLM jailbreak robustness research. Its design centers on reproducibility, correctness, and extensibility. The framework implements twelve adversarial attack algorithms, integrates seven benchmark datasets spanning harmfulness, over-refusal, and utility evaluation, and provides access to a wide range of open-weight LLMs via Hugging Face. The implementation includes advanced features for comparability and reproducibility such as compute-resource tracking, deterministic results, and distributional evaluation techniques. \name also integrates judging through the companion package JudgeZoo, which can also be used independently. Together, these components aim to establish a robust foundation for transparent, comparable, and reproducible research in LLM safety.

Paper Structure

This paper contains 22 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Adver-saria-LLM is a framework for reproducible and principled LLM adversarial robustness evaluation.
  • Figure 2: Our implementation tokenizes the whole input conversation at once, catching more illegal token sequences than other toolboxes. Prior implementations tokenize prompt and suffix separately and only check the suffix for round-trip encode-decode consistency. This makes them unable to detect merges across segment boundaries and leads to attacks which work in token-space, but are impossible to trigger with text input.
  • Figure 3: Tokenization details have a significant effect on ASR. Our implementation fixes several issues and leads to significantly improved performance. We show data for GCG against Llama-2-7B-Instruct on the non-copyright subset of the HarmBench dataset. We report cumulative best-of-$n$ ASR (i.e., at each step, the current prompt iterate is used to query the model).