Table of Contents
Fetching ...

OrgForge-IT: A Verifiable Synthetic Benchmark for LLM-Based Insider Threat Detection

Jeffrey Flynt

Abstract

Synthetic insider threat benchmarks face a consistency problem: corpora generated without an external factual constraint cannot rule out cross-artifact contradictions. The CERT dataset -- the field's canonical benchmark -- is also static, lacks cross-surface correlation scenarios, and predates the LLM era. We present OrgForge-IT, a verifiable synthetic benchmark in which a deterministic simulation engine maintains ground truth and language models generate only surface prose, making cross-artifact consistency an architectural guarantee. The corpus spans 51 simulated days, 2,904 telemetry records at a 96.4% noise rate, and four detection scenarios designed to defeat single-surface and single-day triage strategies across three threat classes and eight injectable behaviors. A ten-model leaderboard reveals several findings: (1) triage and verdict accuracy dissociate - eight models achieve identical triage F1=0.80 yet split between verdict F1=1.0 and 0.80; (2) baseline false-positive rate is a necessary companion to verdict F1, with models at identical verdict accuracy differing by two orders of magnitude on triage noise; (3) victim attribution in the vishing scenario separates tiers - Tier A models exonerate the compromised account holder while Tier B models detect the attack but misclassify the victim; (4) rigid multi-signal thresholds structurally exclude single-surface negligent insiders, demonstrating the necessity of parallel, threat-class-specific triage pipelines; and (5) agentic software-engineering training acts as a force multiplier for multi-day temporal correlation, but only when paired with frontier-level parameter scale. Finally, prompt sensitivity analysis reveals that unstructured prompts induce vocabulary hallucination, motivating a two-track scoring framework separating prompt adherence from reasoning capability. OrgForge-IT is open source under the MIT license.

OrgForge-IT: A Verifiable Synthetic Benchmark for LLM-Based Insider Threat Detection

Abstract

Synthetic insider threat benchmarks face a consistency problem: corpora generated without an external factual constraint cannot rule out cross-artifact contradictions. The CERT dataset -- the field's canonical benchmark -- is also static, lacks cross-surface correlation scenarios, and predates the LLM era. We present OrgForge-IT, a verifiable synthetic benchmark in which a deterministic simulation engine maintains ground truth and language models generate only surface prose, making cross-artifact consistency an architectural guarantee. The corpus spans 51 simulated days, 2,904 telemetry records at a 96.4% noise rate, and four detection scenarios designed to defeat single-surface and single-day triage strategies across three threat classes and eight injectable behaviors. A ten-model leaderboard reveals several findings: (1) triage and verdict accuracy dissociate - eight models achieve identical triage F1=0.80 yet split between verdict F1=1.0 and 0.80; (2) baseline false-positive rate is a necessary companion to verdict F1, with models at identical verdict accuracy differing by two orders of magnitude on triage noise; (3) victim attribution in the vishing scenario separates tiers - Tier A models exonerate the compromised account holder while Tier B models detect the attack but misclassify the victim; (4) rigid multi-signal thresholds structurally exclude single-surface negligent insiders, demonstrating the necessity of parallel, threat-class-specific triage pipelines; and (5) agentic software-engineering training acts as a force multiplier for multi-day temporal correlation, but only when paired with frontier-level parameter scale. Finally, prompt sensitivity analysis reveals that unstructured prompts induce vocabulary hallucination, motivating a two-track scoring framework separating prompt adherence from reasoning capability. OrgForge-IT is open source under the MIT license.
Paper Structure (54 sections, 5 figures, 5 tables)

This paper contains 54 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Architecture of the physics-cognition boundary. The deterministic Python engine owns all factual simulation state and writes ground truth; language models operate exclusively at designated injection points, generating surface prose from validated proposals. LLMs cannot write to the event log or mutate state directly, making cross-artifact consistency an architectural guarantee rather than an empirical claim.
  • Figure 2: The sliding window problem illustrated by the three-phase host data hoarding scenario. Each phase falls below the two-signal escalation threshold when examined in isolation. A hoarding_trail_start_day breadcrumb links phase 3 back to phase 1, but a triage agent processing only the current window's events treats phase 3 as an isolated archive move rather than the culmination of a multi-day exfiltration sequence.
  • Figure 3: Three-stage evaluation pipeline. Stage 1 establishes a clean-data false-positive rate using pre-onset baseline records. Stage 2 applies a 7-day sliding window with a two-signal escalation threshold. Stage 3 receives only escalated suspects and produces a structured JSON verdict with per-artifact evidence citations. Strong Stage 2 performance does not predict Stage 3 performance.
  • Figure 4: Triage F$_1$ vs. verdict F$_1$ for all ten models. Bubble area is proportional to baseline false-positive rate (larger = more triage noise). Eight models cluster at triage F$_1 = 0.80$ yet split between verdict F$_1 = 1.0$ (Tier A) and F$_1 = 0.80$ (Tier B), illustrating the triage/verdict dissociation. Within Tier B, Llama 3.3 70B's bubble (FP = 0.813, red outline) dominates the chart relative to the tight clean-baseline cluster (FP $\leq$ 0.023), making the operational cost of equal verdict accuracy immediately visible. Llama 3.3 70B is operationally disqualified despite matching Tier B verdict F$_1$.
  • Figure 5: Day 19 timeline showing two simultaneous sessions on Chris's account. The anomalous session (macOS, TOTP, residential IP) is initiated 17 minutes after Jax's vishing call and is inconsistent with Chris's established device profile (iOS, push notification, corporate IP). Tier A models reason that mutually inconsistent concurrent sessions identify attacker access rather than account-holder behavior, returning innocent for Chris. Tier B models flag the anomalous session as evidence against Chris, producing the false positive verdict that holds all eight at precision 0.667.