LJ-Bench: Ontology-Based Benchmark for U.S. Crime

Hung Yun Tseng; Wuzhen Li; Blerina Gkotse; Grigorios Chrysos

LJ-Bench: Ontology-Based Benchmark for U.S. Crime

Hung Yun Tseng, Wuzhen Li, Blerina Gkotse, Grigorios Chrysos

Abstract

The potential of Large Language Models (LLMs) to provide harmful information remains a significant concern due to the vast breadth of illegal queries they may encounter. Unfortunately, existing benchmarks only focus on a handful types of illegal activities, and are not grounded in legal works. In this work, we introduce an ontology of crime-related concepts grounded in the legal frameworks of Model Panel Code, which serves as an influential reference for criminal law and has been adopted by many U.S. states, and instantiated using Californian Law. This structured knowledge forms the foundation for LJ-Bench, the first comprehensive benchmark designed to evaluate LLM robustness against a wide range of illegal activities. Spanning 76 distinct crime types organized taxonomically, LJ-Bench enables systematic assessment of diverse attacks, revealing valuable insights into LLM vulnerabilities across various crime categories: LLMs exhibit heightened susceptibility to attacks targeting societal harm rather than those directly impacting individuals. Our benchmark aims to facilitate the development of more robust and trustworthy LLMs. The LJ-Bench benchmark and LJ-Ontology, along with experiments implementation for reproducibility are publicly available at https://github.com/AndreaTseng/LJ-Bench.

LJ-Bench: Ontology-Based Benchmark for U.S. Crime

Abstract

Paper Structure (46 sections, 29 figures, 9 tables)

This paper contains 46 sections, 29 figures, 9 tables.

Introduction
Related work
Categories of illegal activities
LJ Ontology and knowledge graph
LJ-Bench
Experiments
Experiment setup
Additional evaluation metrics
Broad category results
Fine-grained results
Augmented dataset
Conclusion
Limitation
Broader impact
Datasheet for dataset
...and 31 more sections

Figures (29)

Figure 1: Comparison among selected types of crime. (a) Types of crime that have few questions in existing benchmarks or (b) New types of crime that do not exist in previous benchmarks. We annotated existing benchmarks manually for comparison. For the full lists of existing and new types of crimes, see \ref{['tab:benchmark_jailbreaking_benchmark_comparison_types_per_benchmark', 'table:benchmark_jailbreaking_comparison_new_questions']}.
Figure 2: Similarity of Political Campaign prompts when comparing AdvBench (left) and LJ-Bench (right). Notice that the AdvBench includes higher similarities across questions, with values reaching up to 0.98, whereas LJ-Bench shows more diversity among questions with a maximum similarity of only 0.71. Additional plots for more examples exist in \ref{['sec:benchmark_jailbreaking_crime_comparison_other_benchmarks_appendix']}.
Figure 3: Benchmark jailbreaking results of Gemini and GPT models under 10 attacks. All models score 4.5+ in all four categories with only one exception (GPT-4o-mini scores 4.2 as its highest for "Against Animal"). Gemini models struggle most with "Against Property" scenarios across nearly all attack types, and newer GPT models are vulnerable in "Against Animal" category under PAP attacks. Surprisingly, PAP—a non-iterative attack employing just 5 persuasive techniques—demonstrates effectiveness nearly equivalent to PAIR across all Gemini models. This reveals Gemini's vulnerability when harmful content is rephrased with authority appeals or evidence presentation. The exact scores and the standard deviation are reported in \ref{['tab:benchmark_jailbreaking_results_close_source_models']}.
Figure 4: Jailbreaking results from open source models under 6 attacks (excluding iterative attacks), using Gemini 1.5 pro as the autograder. The eight models show different levels of vulnerability, with DeepSeek-llm-67b and Mistral-7B-Instruct displaying high susceptibility to attacks, while Llama-3.1-8B and Gemma-2b demonstrate strongest resistance to the prompt-based attacks evaluated here. It is noteworthy that Gemma-2b resists all of these attacks despite being the smallest model among those we tested. The exact scores and the standard deviation are reported in \ref{['tab:model_performance_6attacks_open_source']}.
Figure 5: Jailbreaking scores for Gemini models using 4 different autograders. The consistent scoring patterns across judges confirms Gemini-1.5-Pro's reliability as the primary autograder. This strong correlation between the judges, despite their different architectures, validates our jailbreaking assessments.
...and 24 more figures

LJ-Bench: Ontology-Based Benchmark for U.S. Crime

Abstract

LJ-Bench: Ontology-Based Benchmark for U.S. Crime

Authors

Abstract

Table of Contents

Figures (29)