Table of Contents
Fetching ...

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

Prannaya Gupta, Le Qi Yau, Hao Han Low, I-Shiang Lee, Hugo Maximus Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj, Soujanya Poria

TL;DR

WalledEval addresses the need for a unified safety-evaluation platform for LLMs, spanning open-weight and API-based models, with 35+ benchmarks across multilingual, exaggerated safety, and prompt-injection defenses. The framework uses three core abstractions (dataset loader, LLM loader, judge) to enable LLM benchmarking, judge benchmarking, and MCQ-based refusal testing, and introduces mutators to stress test safety under text mutations. It also contributes WalledGuard, a compact content-moderation model, and culturally aware benchmarks SGXSTest and HIXSTest, plus the concept of LLMs-as-a-Judge for holistic evaluation. Empirical results show varying safety performance across models and highlight cultural-context gaps, demonstrating WalledEval's utility for comprehensive safety audits and model/judge comparison.

Abstract

WalledEval is a comprehensive AI safety testing toolkit designed to evaluate large language models (LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones, and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections. The framework supports both LLM and judge benchmarking and incorporates custom mutators to test safety against various text-style mutations, such as future tense and paraphrasing. Additionally, WalledEval introduces WalledGuard, a new, small, and performant content moderation tool, and two datasets: SGXSTest and HIXSTest, which serve as benchmarks for assessing the exaggerated safety of LLMs and judges in cultural contexts. We make WalledEval publicly available at https://github.com/walledai/walledeval.

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

TL;DR

WalledEval addresses the need for a unified safety-evaluation platform for LLMs, spanning open-weight and API-based models, with 35+ benchmarks across multilingual, exaggerated safety, and prompt-injection defenses. The framework uses three core abstractions (dataset loader, LLM loader, judge) to enable LLM benchmarking, judge benchmarking, and MCQ-based refusal testing, and introduces mutators to stress test safety under text mutations. It also contributes WalledGuard, a compact content-moderation model, and culturally aware benchmarks SGXSTest and HIXSTest, plus the concept of LLMs-as-a-Judge for holistic evaluation. Empirical results show varying safety performance across models and highlight cultural-context gaps, demonstrating WalledEval's utility for comprehensive safety audits and model/judge comparison.

Abstract

WalledEval is a comprehensive AI safety testing toolkit designed to evaluate large language models (LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones, and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections. The framework supports both LLM and judge benchmarking and incorporates custom mutators to test safety against various text-style mutations, such as future tense and paraphrasing. Additionally, WalledEval introduces WalledGuard, a new, small, and performant content moderation tool, and two datasets: SGXSTest and HIXSTest, which serve as benchmarks for assessing the exaggerated safety of LLMs and judges in cultural contexts. We make WalledEval publicly available at https://github.com/walledai/walledeval.
Paper Structure (29 sections, 2 figures, 4 tables)

This paper contains 29 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: WalledEval framework for conducting safety tests on LLMs.
  • Figure 2: WalledEval supports data loading from Python list, CSV, JSON, and HuggingFace datasets.