ORCHID: Streaming Threat Detection over Versioned Provenance Graphs
Akul Goyal, Jason Liu, Adam Bates, Gang Wang
TL;DR
ORCHID tackles real-time threat detection on versioned provenance graphs by streaming embeddings of per-entity histories with a GRU-based sequential model, avoiding full graph storage. By incorporating root-cause context into the embedding, ORCHID captures long-range dependencies while maintaining a memory footprint near O(|V|) and near-zero per-edge latency. Across four public datasets, ORCHID achieves competitive or superior anomaly detection compared with offline and streaming GNN baselines, while dramatically reducing memory usage (roughly 2.7 GB vs 143.7 GB) and detection lag (0.002 s per event vs hours). This work demonstrates that lossless provenance-based intrusion detection can scale to real-time EDR workloads, enabling timely discovery of attacker activity with low resource demands and robust resilience to distribution shifts.
Abstract
While Endpoint Detection and Response (EDR) are able to efficiently monitor threats by comparing static rules to the event stream, their inability to incorporate past system context leads to high rates of false alarms. Recent work has demonstrated Provenance-based Intrusion Detection Systems (Prov-IDS) that can examine the causal relationships between abnormal behaviors to improve threat classification. However, employing these Prov-IDS in practical settings remains difficult -- state-of-the-art neural network based systems are only fast in a fully offline deployment model that increases attacker dwell time, while simultaneously using simplified and less accurate provenance graphs to reduce memory consumption. Thus, today's Prov-IDS cannot operate effectively in the real-time streaming setting required for commercial EDR viability. This work presents the design and implementation of ORCHID, a novel Prov-IDS that performs fine-grained detection of process-level threats over a real time event stream. ORCHID takes advantage of the unique immutable properties of a versioned provenance graphs to iteratively embed the entire graph in a sequential RNN model while only consuming a fraction of the computation and memory costs. We evaluate ORCHID on four public datasets, including DARPA TC, to show that ORCHID can provide competitive classification performance while eliminating detection lag and reducing memory consumption by two orders of magnitude.
