Table of Contents
Fetching ...

Semantic Superiority vs. Forensic Efficiency: A Comparative Analysis of Deep Learning and Psycholinguistics for Business Email Compromise Detection

Yaw Osei Adjei, Frederick Ayivor, Davis Opoku

TL;DR

The paper tackles BEC detection by juxtaposing a semantic deep-learning stream (DistilBERT) with a forensic psycholinguistic stream (CatBoost) under cost-sensitive constraints. It introduces a hybrid dataset that blends legitimate Enron-like emails with AI-synthesized BEC samples and applies a homoglyph normalization and a financial loss function to optimize thresholds. DistilBERT achieves near-perfect AUC and high F1 on GPU-accelerated hardware, while CatBoost delivers competitive performance with substantially lower latency and resource use, enabling edge deployment. The study highlights the ROI and infrastructure trade-offs, providing a practical framework for choosing between high-accuracy, GPU-intensive models and cost-efficient, CPU-friendly alternatives, and identifies directions for longitudinal validation and psycholinguistic feature verification.

Abstract

Business Email Compromise (BEC) is a sophisticated social engineering threat that manipulates organizational hierarchies, leading to significant financial damage. According to the 2024 FBI Internet Crime Report, BEC accounts for over $2.9 billion in annual losses, presenting a massive economic asymmetry: the financial cost of a False Negative (fraud loss) exceeds the operational cost of a False Positive (manual review) by a ratio of approximately 5,480:1. This paper contrasts two detection paradigms: a Forensic Psycholinguistic Stream (CatBoost), which analyzes linguistic cues like urgency and authority with high interpretability, and a Semantic Stream (DistilBERT), which utilizes deep learning for contextual understanding. We evaluated both streams on a hybrid dataset (N=7,990) containing human-legitimate and AI-synthesized adversarial fraud. Benchmarked on Tesla T4 infrastructure, DistilBERT achieved near-perfect detection on synthetic threats (AUC >0.99, F1 =0.998) with acceptable real-time latency (7.4 ms). CatBoost achieved competitive detection (AUC =0.991, F1 =0.949) at 8.4x lower latency (0.8 ms) with negligible resource consumption. We conclude that while DistilBERT offers maximum accuracy for GPU-equipped organizations, CatBoost provides a viable, cost-effective alternative for edge deployments. Both approaches demonstrate a theoretical ROI exceeding 99.9% when optimized via cost-sensitive learning.

Semantic Superiority vs. Forensic Efficiency: A Comparative Analysis of Deep Learning and Psycholinguistics for Business Email Compromise Detection

TL;DR

The paper tackles BEC detection by juxtaposing a semantic deep-learning stream (DistilBERT) with a forensic psycholinguistic stream (CatBoost) under cost-sensitive constraints. It introduces a hybrid dataset that blends legitimate Enron-like emails with AI-synthesized BEC samples and applies a homoglyph normalization and a financial loss function to optimize thresholds. DistilBERT achieves near-perfect AUC and high F1 on GPU-accelerated hardware, while CatBoost delivers competitive performance with substantially lower latency and resource use, enabling edge deployment. The study highlights the ROI and infrastructure trade-offs, providing a practical framework for choosing between high-accuracy, GPU-intensive models and cost-efficient, CPU-friendly alternatives, and identifies directions for longitudinal validation and psycholinguistic feature verification.

Abstract

Business Email Compromise (BEC) is a sophisticated social engineering threat that manipulates organizational hierarchies, leading to significant financial damage. According to the 2024 FBI Internet Crime Report, BEC accounts for over $2.9 billion in annual losses, presenting a massive economic asymmetry: the financial cost of a False Negative (fraud loss) exceeds the operational cost of a False Positive (manual review) by a ratio of approximately 5,480:1. This paper contrasts two detection paradigms: a Forensic Psycholinguistic Stream (CatBoost), which analyzes linguistic cues like urgency and authority with high interpretability, and a Semantic Stream (DistilBERT), which utilizes deep learning for contextual understanding. We evaluated both streams on a hybrid dataset (N=7,990) containing human-legitimate and AI-synthesized adversarial fraud. Benchmarked on Tesla T4 infrastructure, DistilBERT achieved near-perfect detection on synthetic threats (AUC >0.99, F1 =0.998) with acceptable real-time latency (7.4 ms). CatBoost achieved competitive detection (AUC =0.991, F1 =0.949) at 8.4x lower latency (0.8 ms) with negligible resource consumption. We conclude that while DistilBERT offers maximum accuracy for GPU-equipped organizations, CatBoost provides a viable, cost-effective alternative for edge deployments. Both approaches demonstrate a theoretical ROI exceeding 99.9% when optimized via cost-sensitive learning.

Paper Structure

This paper contains 39 sections, 2 equations, 12 figures, 7 tables, 2 algorithms.

Figures (12)

  • Figure 1: Comparative Architecture. Stream A (Forensic/Green) extracts psycholinguistic features for a Gradient Boosting classifier. Stream B (Semantic/Blue) uses a hardened Tokenizer and Transformer to extract deep embeddings. GPU acceleration reduces DistilBERT latency to real-time acceptable ranges.
  • Figure 2: Latency Distribution (1,586 samples). CatBoost demonstrates consistent sub-millisecond performance (mean = 0.885 ms, green triangle). DistilBERT achieves acceptable real-time latency (mean = 7.403 ms, green triangle) with GPU acceleration, though with higher variance due to variable text lengths.
  • Figure 3: Learning Curve (CatBoost). Validation AUC plateaus at approximately 3,000 training samples. The gap between training (red) and validation (blue) curves remains minimal, indicating good generalization. The shaded region represents 95% confidence intervals across 5-fold cross-validation.
  • Figure 4: Adversarial Robustness Analysis and Error Distribution. Top: Recall on Clean vs. Poisoned data (Left) and Degradation magnitude (Right). Bottom: McNemar's Contingency Table (Left) and Pie Chart of Agreement (Right). The statistical analysis indicates that we must manage the significant disagreement (1.6%).
  • Figure 5: Model Prediction Correlation (r = 0.9756). Both models show strong agreement on most samples. The majority of samples cluster near (0, 0) for legitimate emails and (1, 1) for fraudulent emails. DistilBERT shows more decisive predictions (binary clustering), while CatBoost exhibits more graduated probability distributions.
  • ...and 7 more figures