Table of Contents
Fetching ...

DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection

Joymallya Chakraborty, Wei Xia, Anirban Majumder, Dan Ma, Walid Chaabene, Naveed Janvekar

TL;DR

DetoxBench provides a dedicated, multitask benchmark for large language models on fraud and abuse detection, compiling eight public datasets that span hate speech, toxic chat, phishing, spam, fake news, fraudulent postings, and misogyny to evaluate diverse LLMs. The study compares multiple Bedrock-enabled providers (AI21 Jurassic, Cohere Command, Anthropic Claude, Mistral Mixtral/Large) under zero- and few-shot prompting, revealing that model size and architecture drive performance, with Mistral Large and Claude variants often leading in F1, while some models show high precision but low recall on abusive content. It also highlights practical considerations like context windows, inference speed, and JSON-format outputs, and discusses limitations including non-real fraud data, English-only scope, and API constraints. The work offers actionable insights for building robust, trustworthy fraud/abuse detection systems and outlines future directions such as fine-tuning and more advanced prompting strategies to enhance multi-task performance.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks. However, their practical application in high-stake domains, such as fraud and abuse detection, remains an area that requires further exploration. The existing applications often narrowly focus on specific tasks like toxicity or hate speech detection. In this paper, we present a comprehensive benchmark suite designed to assess the performance of LLMs in identifying and mitigating fraudulent and abusive language across various real-world scenarios. Our benchmark encompasses a diverse set of tasks, including detecting spam emails, hate speech, misogynistic language, and more. We evaluated several state-of-the-art LLMs, including models from Anthropic, Mistral AI, and the AI21 family, to provide a comprehensive assessment of their capabilities in this critical domain. The results indicate that while LLMs exhibit proficient baseline performance in individual fraud and abuse detection tasks, their performance varies considerably across tasks, particularly struggling with tasks that demand nuanced pragmatic reasoning, such as identifying diverse forms of misogynistic language. These findings have important implications for the responsible development and deployment of LLMs in high-risk applications. Our benchmark suite can serve as a tool for researchers and practitioners to systematically evaluate LLMs for multi-task fraud detection and drive the creation of more robust, trustworthy, and ethically-aligned systems for fraud and abuse detection.

DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection

TL;DR

DetoxBench provides a dedicated, multitask benchmark for large language models on fraud and abuse detection, compiling eight public datasets that span hate speech, toxic chat, phishing, spam, fake news, fraudulent postings, and misogyny to evaluate diverse LLMs. The study compares multiple Bedrock-enabled providers (AI21 Jurassic, Cohere Command, Anthropic Claude, Mistral Mixtral/Large) under zero- and few-shot prompting, revealing that model size and architecture drive performance, with Mistral Large and Claude variants often leading in F1, while some models show high precision but low recall on abusive content. It also highlights practical considerations like context windows, inference speed, and JSON-format outputs, and discusses limitations including non-real fraud data, English-only scope, and API constraints. The work offers actionable insights for building robust, trustworthy fraud/abuse detection systems and outlines future directions such as fine-tuning and more advanced prompting strategies to enhance multi-task performance.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks. However, their practical application in high-stake domains, such as fraud and abuse detection, remains an area that requires further exploration. The existing applications often narrowly focus on specific tasks like toxicity or hate speech detection. In this paper, we present a comprehensive benchmark suite designed to assess the performance of LLMs in identifying and mitigating fraudulent and abusive language across various real-world scenarios. Our benchmark encompasses a diverse set of tasks, including detecting spam emails, hate speech, misogynistic language, and more. We evaluated several state-of-the-art LLMs, including models from Anthropic, Mistral AI, and the AI21 family, to provide a comprehensive assessment of their capabilities in this critical domain. The results indicate that while LLMs exhibit proficient baseline performance in individual fraud and abuse detection tasks, their performance varies considerably across tasks, particularly struggling with tasks that demand nuanced pragmatic reasoning, such as identifying diverse forms of misogynistic language. These findings have important implications for the responsible development and deployment of LLMs in high-risk applications. Our benchmark suite can serve as a tool for researchers and practitioners to systematically evaluate LLMs for multi-task fraud detection and drive the creation of more robust, trustworthy, and ethically-aligned systems for fraud and abuse detection.
Paper Structure (33 sections, 5 figures, 3 tables)

This paper contains 33 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The DetoxBench Pipeline
  • Figure 2: Classification performance of 8 LLMs for different tasks with zero shot prompting strategy. The results are shown in the form of three metrics - Precision (Blue), Recall (Yellow), and F1 Score (Green). We referred the LLMs using these abbreviations - AIJ2M - AI21.J2-Mid; AIJU - AI21.J2-Ultra; CCT - Cohere.Command-Text; CCLT - Cohere.Command-Light-Text; ACV2: Anthropic.Claude-V2; ACV2:1 - Anthropic.Claude-V2:1; MM - Mixtral MOE; ML - Mistral Large.
  • Figure 3: Classification performance of 8 LLMs for different tasks with few shot prompting strategy. The results are shown in the form of three metrics - Precision (Blue), Recall (Yellow), and F1 Score (Green).
  • Figure 4: F1 score of 8 LLMs for different tasks with zero shot prompting and few shot prompting strategy respectively.
  • Figure 5: The Transformer - model architecture vaswani2023attention