Table of Contents
Fetching ...

StealthCup: Realistic, Multi-Stage, Evasion-Focused CTF for Benchmarking IDS

Manuel Kern, Dominik Steffan, Felix Schuster, Florian Skopik, Max Landauer, David Allison, Simon Freudenthaler, Edgar Weippl

TL;DR

StealthCup tackles the challenge of evaluating IDS under realistic, stealthy, multi-stage intrusions by deploying an evasion-focused Capture-the-Flag framework on a reproducible IT/OT testbed. It combines human attacker play with Infrastructure-as-Code deployments, near-real-time detection feedback, and a transparent scoring scheme to surface IDS blind spots and evasion strategies. The inaugural run assessed 32 techniques, with 11 remaining undetected across all IDS configurations, and highlighted trade-offs between open-source and commercial solutions in terms of false positives and detection coverage. By aligning attacker behavior with state-sponsored tradecraft such as Volt Typhoon and releasing open datasets and attacker writeups, StealthCup provides a reproducible benchmark that extends beyond traditional datasets and MITRE evaluations to stress-test IDS under stealth-focused adversaries.

Abstract

Intrusion Detection Systems (IDS) are critical to defending enterprise and industrial control environments, yet evaluating their effectiveness under realistic conditions remains an open challenge. Existing benchmarks rely on synthetic datasets (e.g., NSL-KDD, CICIDS2017) or scripted replay frameworks, which fail to capture adaptive adversary behavior. Even MITRE ATT&CK Evaluations, while influential, are host-centric and assume malware-driven compromise, thereby under-representing stealthy, multi-stage intrusions across IT and OT domains. We present StealthCup, a novel evaluation methodology that operationalizes IDS benchmarking as an evasion-focused Capture-the-Flag competition. Professional penetration testers engaged in multi-stage attack chains on a realistic IT/OT testbed, with scoring penalizing IDS detections. The event generated structured attacker writeups, validated detections, and PCAPs, host logs, and alerts. Our results reveal that out of 32 exercised attack techniques, 11 were not detected by any IDS configuration. Open-source systems (Wazuh, Suricata) produced high false-positive rates >90%, while commercial tools generated fewer false positives but also missed more attacks. Comparison with the Volt Typhoon APT advisory confirmed strong realism: all 28 applicable techniques were exercised, 19 appeared in writeups, and 9 in forensic traces. These findings demonstrate that StealthCup elicits attacker behavior closely aligned with state-sponsored TTPs, while exposing blind spots across both open-source and commercial IDS. The resulting datasets and methodology provide a reproducible foundation for future stealth-focused IDS evaluation.

StealthCup: Realistic, Multi-Stage, Evasion-Focused CTF for Benchmarking IDS

TL;DR

StealthCup tackles the challenge of evaluating IDS under realistic, stealthy, multi-stage intrusions by deploying an evasion-focused Capture-the-Flag framework on a reproducible IT/OT testbed. It combines human attacker play with Infrastructure-as-Code deployments, near-real-time detection feedback, and a transparent scoring scheme to surface IDS blind spots and evasion strategies. The inaugural run assessed 32 techniques, with 11 remaining undetected across all IDS configurations, and highlighted trade-offs between open-source and commercial solutions in terms of false positives and detection coverage. By aligning attacker behavior with state-sponsored tradecraft such as Volt Typhoon and releasing open datasets and attacker writeups, StealthCup provides a reproducible benchmark that extends beyond traditional datasets and MITRE evaluations to stress-test IDS under stealth-focused adversaries.

Abstract

Intrusion Detection Systems (IDS) are critical to defending enterprise and industrial control environments, yet evaluating their effectiveness under realistic conditions remains an open challenge. Existing benchmarks rely on synthetic datasets (e.g., NSL-KDD, CICIDS2017) or scripted replay frameworks, which fail to capture adaptive adversary behavior. Even MITRE ATT&CK Evaluations, while influential, are host-centric and assume malware-driven compromise, thereby under-representing stealthy, multi-stage intrusions across IT and OT domains. We present StealthCup, a novel evaluation methodology that operationalizes IDS benchmarking as an evasion-focused Capture-the-Flag competition. Professional penetration testers engaged in multi-stage attack chains on a realistic IT/OT testbed, with scoring penalizing IDS detections. The event generated structured attacker writeups, validated detections, and PCAPs, host logs, and alerts. Our results reveal that out of 32 exercised attack techniques, 11 were not detected by any IDS configuration. Open-source systems (Wazuh, Suricata) produced high false-positive rates >90%, while commercial tools generated fewer false positives but also missed more attacks. Comparison with the Volt Typhoon APT advisory confirmed strong realism: all 28 applicable techniques were exercised, 19 appeared in writeups, and 9 in forensic traces. These findings demonstrate that StealthCup elicits attacker behavior closely aligned with state-sponsored TTPs, while exposing blind spots across both open-source and commercial IDS. The resulting datasets and methodology provide a reproducible foundation for future stealth-focused IDS evaluation.

Paper Structure

This paper contains 48 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: StealthCup multi-level IT/OT infrastructure including NIDS and SIEM in red. The bug symbol indicates installations of HIDS/EDR/XDR, while the tap indicates network-tapping.
  • Figure 2: Virtualized PLC controlling a simulated water pump: (a) interaction via ScadaLTS HMI, (b) logging and monitoring via Grafana historian.
  • Figure 3: Stealthcup multi staged attack chain that has been implemented including the main objectives and the detections validated in our testrun.
  • Figure 4: Timeline of the competition, showing objective solves and submitted detection scores (on a logarithmic scale, lower is better).
  • Figure 5: Comparison of MITRE Eval mitreeval coverage (left) and alert/score profiles (right) for Teams 2 (Ent./OT Cup), 6 (Ent. Cup Winner) and 9. (Ent. Cup)