Table of Contents
Fetching ...

aiXamine: Simplified LLM Safety and Security

Fatih Deniz, Dorde Popovic, Yazan Boshmaf, Euisuh Jeong, Minhaj Ahmad, Sanjay Chawla, Issa Khalil

TL;DR

aiXamine introduces a comprehensive black-box evaluation platform for LLM safety and security, aggregating over 40 tests across eight safety domains into a unified examination framework. The system uses a DAG-based, microservices architecture to automate test execution, scoring, and detailed reporting, while enabling private model submissions and cross-version comparisons. Real-world evaluations across 50+ models reveal that proprietary systems often excel overall, yet open-source models can rival them in specific services such as safety alignment, fairness, and OOD robustness, highlighting meaningful trade-offs in distillation, training, and architecture. The work demonstrates the platform's potential to support developers, regulators, and practitioners by providing granular, actionable insights, transparency, and a path toward regulatory-aligned evaluation of LLM safety and security.

Abstract

Evaluating Large Language Models (LLMs) for safety and security remains a complex task, often requiring users to navigate a fragmented landscape of ad hoc benchmarks, datasets, metrics, and reporting formats. To address this challenge, we present aiXamine, a comprehensive black-box evaluation platform for LLM safety and security. aiXamine integrates over 40 tests (i.e., benchmarks) organized into eight key services targeting specific dimensions of safety and security: adversarial robustness, code security, fairness and bias, hallucination, model and data privacy, out-of-distribution (OOD) robustness, over-refusal, and safety alignment. The platform aggregates the evaluation results into a single detailed report per model, providing a detailed breakdown of model performance, test examples, and rich visualizations. We used aiXamine to assess over 50 publicly available and proprietary LLMs, conducting over 2K examinations. Our findings reveal notable vulnerabilities in leading models, including susceptibility to adversarial attacks in OpenAI's GPT-4o, biased outputs in xAI's Grok-3, and privacy weaknesses in Google's Gemini 2.0. Additionally, we observe that open-source models can match or exceed proprietary models in specific services such as safety alignment, fairness and bias, and OOD robustness. Finally, we identify trade-offs between distillation strategies, model size, training methods, and architectural choices.

aiXamine: Simplified LLM Safety and Security

TL;DR

aiXamine introduces a comprehensive black-box evaluation platform for LLM safety and security, aggregating over 40 tests across eight safety domains into a unified examination framework. The system uses a DAG-based, microservices architecture to automate test execution, scoring, and detailed reporting, while enabling private model submissions and cross-version comparisons. Real-world evaluations across 50+ models reveal that proprietary systems often excel overall, yet open-source models can rival them in specific services such as safety alignment, fairness, and OOD robustness, highlighting meaningful trade-offs in distillation, training, and architecture. The work demonstrates the platform's potential to support developers, regulators, and practitioners by providing granular, actionable insights, transparency, and a path toward regulatory-aligned evaluation of LLM safety and security.

Abstract

Evaluating Large Language Models (LLMs) for safety and security remains a complex task, often requiring users to navigate a fragmented landscape of ad hoc benchmarks, datasets, metrics, and reporting formats. To address this challenge, we present aiXamine, a comprehensive black-box evaluation platform for LLM safety and security. aiXamine integrates over 40 tests (i.e., benchmarks) organized into eight key services targeting specific dimensions of safety and security: adversarial robustness, code security, fairness and bias, hallucination, model and data privacy, out-of-distribution (OOD) robustness, over-refusal, and safety alignment. The platform aggregates the evaluation results into a single detailed report per model, providing a detailed breakdown of model performance, test examples, and rich visualizations. We used aiXamine to assess over 50 publicly available and proprietary LLMs, conducting over 2K examinations. Our findings reveal notable vulnerabilities in leading models, including susceptibility to adversarial attacks in OpenAI's GPT-4o, biased outputs in xAI's Grok-3, and privacy weaknesses in Google's Gemini 2.0. Additionally, we observe that open-source models can match or exceed proprietary models in specific services such as safety alignment, fairness and bias, and OOD robustness. Finally, we identify trade-offs between distillation strategies, model size, training methods, and architectural choices.

Paper Structure

This paper contains 63 sections, 1 equation, 4 figures, 14 tables.

Figures (4)

  • Figure 1: High-level design overview of aiXamine.
  • Figure 2: The aiXamine system's main page.
  • Figure 3: The aiXamine leaderboard page.
  • Figure 4: The aiXamine report page.