ORCHID: Streaming Threat Detection over Versioned Provenance Graphs

Akul Goyal; Jason Liu; Adam Bates; Gang Wang

ORCHID: Streaming Threat Detection over Versioned Provenance Graphs

Akul Goyal, Jason Liu, Adam Bates, Gang Wang

TL;DR

ORCHID tackles real-time threat detection on versioned provenance graphs by streaming embeddings of per-entity histories with a GRU-based sequential model, avoiding full graph storage. By incorporating root-cause context into the embedding, ORCHID captures long-range dependencies while maintaining a memory footprint near O(|V|) and near-zero per-edge latency. Across four public datasets, ORCHID achieves competitive or superior anomaly detection compared with offline and streaming GNN baselines, while dramatically reducing memory usage (roughly 2.7 GB vs 143.7 GB) and detection lag (0.002 s per event vs hours). This work demonstrates that lossless provenance-based intrusion detection can scale to real-time EDR workloads, enabling timely discovery of attacker activity with low resource demands and robust resilience to distribution shifts.

Abstract

While Endpoint Detection and Response (EDR) are able to efficiently monitor threats by comparing static rules to the event stream, their inability to incorporate past system context leads to high rates of false alarms. Recent work has demonstrated Provenance-based Intrusion Detection Systems (Prov-IDS) that can examine the causal relationships between abnormal behaviors to improve threat classification. However, employing these Prov-IDS in practical settings remains difficult -- state-of-the-art neural network based systems are only fast in a fully offline deployment model that increases attacker dwell time, while simultaneously using simplified and less accurate provenance graphs to reduce memory consumption. Thus, today's Prov-IDS cannot operate effectively in the real-time streaming setting required for commercial EDR viability. This work presents the design and implementation of ORCHID, a novel Prov-IDS that performs fine-grained detection of process-level threats over a real time event stream. ORCHID takes advantage of the unique immutable properties of a versioned provenance graphs to iteratively embed the entire graph in a sequential RNN model while only consuming a fraction of the computation and memory costs. We evaluate ORCHID on four public datasets, including DARPA TC, to show that ORCHID can provide competitive classification performance while eliminating detection lag and reducing memory consumption by two orders of magnitude.

ORCHID: Streaming Threat Detection over Versioned Provenance Graphs

TL;DR

Abstract

Paper Structure (50 sections, 2 equations, 7 figures, 2 tables)

This paper contains 50 sections, 2 equations, 7 figures, 2 tables.

Introduction
Motivation
Key Limitations of Prior Work
Memory Overhead
Detection Lag
Our Approach
Threat Model
Orchid Design
Preliminaries and Background
Provenance Graph
Sequence-Based ML and RNN
Key Idea
Vectorizing System Entities
Embedding and Detection
Accounting for Long Term Dependencies
...and 35 more sections

Figures (7)

Figure 1: Overview of the Orchid architecture, as demonstrated on an example provenance graph. Green and red nodes represent benign and attack entities, respectively. To create a more precise and acyclic provenance graph, entities are versioned in the graph as they are updated, e.g., h1' is a new version of h1. In a GNN, the entire graph including testing data must be collected before training begins. In contrast, Orchid continuously embeds and classifies new events as they occur following a preliminary training period. Times 1-7 demonstrate how Orchid's streaming dictionary has evolved at discrete points in the timeseries. By only maintaining state on the current version of system entities, Orchid is able to maintain a smaller memory footprint, as can be seen by comparing the dimensions of its dictionary to the GNN's matrices.
Figure 2: Hyperparameter tuning of ORCHID. Area Under Curve (AUC) for each line is reported in the legend.
Figure 3: Performance of ORCHID, reported as ROC curves, as compared to GNN-based approaches. In a streaming setting, Orchid' performance is generally comparable to an offline GNN deployment (Full-GNN), particularly in the low-FPR regions of the plot that denote plausible detection thresholds. Adapting the GNN to an online setting with equivalent training data to Orchid (Stream-GNN), the GNN model is thoroughly outperformed.
Figure 4: Memory consumption of different IDS models on the Trace dataset, as compared to the raw audit log. Versioned-GNN denotes the memory footprint of Full-GNN if it operated on the more precise versioned provenance graph used by Orchid. We were only able to successfully train Full-GNN on $2$ days of the versioned graph; subsequent points on this line are estimates.
Figure 5: Detection Lag of different IDS models on the Trace dataset. Averaged from older runs, Full-GNN takes $6$ hours to train and analyze audit log. Because of the granularity of the graph, ORCHID appears to have $0$ lag during detection but ORCHID requires $0.002$ seconds to process each event.
...and 2 more figures

ORCHID: Streaming Threat Detection over Versioned Provenance Graphs

TL;DR

Abstract

ORCHID: Streaming Threat Detection over Versioned Provenance Graphs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)