Table of Contents
Fetching ...

SPARSE: Semantic Tracking and Path Analysis for Attack Investigation in Real-time

Jie Ying, Tiantian Zhu, Wenrui Cheng, Qixuan Yuan, Mingjun Ma, Chunlin Xiong, Tieming Chen, Mingqi Lv, Yan Chen

TL;DR

SPARSE introduces a real-time attack investigation system that addresses the scalability gaps of traditional provenance graphs by first constructing a Suspicious Semantic Graph (SSG) from streaming logs and then performing Path-level Contextual Analysis (PCA) to extract a concise Critical Component Graph (CCG) of attack-related events. It leverages a state-based Suspicious Semantic Transfer mechanism with in-memory Suspicious Entity List (SEL) and on-disk Related Event Table (RET) to efficiently track suspicious semantics, followed by an edge-compaction and a BFS-based Suspicious Flow Path Extraction. Path-level scoring combines data-flow and timing features via EventScore, Impact, and PathScore, using cosine similarity and an inflation factor to constrain irrelevant paths under a threshold $T$, producing a high-fidelity CCG. Evaluations on a large-scale dataset show dramatic reductions in graph size (CCG edges ~113 vs backtracking ~227k) and superior filtering performance (FP around 99 with FN=0), with real-time processing (2s per investigation) and modest memory (~30MB) and disk usage (~21MB). Overall, SPARSE delivers low-latency, low-overhead, high-precision attack investigation that can augment existing intrusion detection and forensics workflows. All results are framed with formal path-based metrics and real-time streaming guarantees, making SPARSE practical for enterprise deployments.

Abstract

As the complexity and destructiveness of Advanced Persistent Threat (APT) increase, there is a growing tendency to identify a series of actions undertaken to achieve the attacker's target, called attack investigation. Currently, analysts construct the provenance graph to perform causality analysis on Point-Of-Interest (POI) event for capturing critical events (related to the attack). However, due to the vast size of the provenance graph and the rarity of critical events, existing attack investigation methods suffer from problems of high false positives, high overhead, and high latency. To this end, we propose SPARSE, an efficient and real-time system for constructing critical component graphs (i.e., consisting of critical events) from streaming logs. Our key observation is 1) Critical events exist in a suspicious semantic graph (SSG) composed of interaction flows between suspicious entities, and 2) Information flows that accomplish attacker's goal exist in the form of paths. Therefore, SPARSE uses a two-stage framework to implement attack investigation (i.e., constructing the SSG and performing path-level contextual analysis). First, SPARSE operates in a state-based mode where events are consumed as streams, allowing easy access to the SSG related to the POI event through semantic transfer rule and storage strategy. Then, SPARSE identifies all suspicious flow paths (SFPs) related to the POI event from the SSG, quantifies the influence of each path to filter irrelevant events. Our evaluation on a real large-scale attack dataset shows that SPARSE can generate a critical component graph (~ 113 edges) in 1.6 seconds, which is 2014 X smaller than the backtracking graph (~ 227,589 edges). SPARSE is 25 X more effective than other state-of-the-art techniques in filtering irrelevant edges.

SPARSE: Semantic Tracking and Path Analysis for Attack Investigation in Real-time

TL;DR

SPARSE introduces a real-time attack investigation system that addresses the scalability gaps of traditional provenance graphs by first constructing a Suspicious Semantic Graph (SSG) from streaming logs and then performing Path-level Contextual Analysis (PCA) to extract a concise Critical Component Graph (CCG) of attack-related events. It leverages a state-based Suspicious Semantic Transfer mechanism with in-memory Suspicious Entity List (SEL) and on-disk Related Event Table (RET) to efficiently track suspicious semantics, followed by an edge-compaction and a BFS-based Suspicious Flow Path Extraction. Path-level scoring combines data-flow and timing features via EventScore, Impact, and PathScore, using cosine similarity and an inflation factor to constrain irrelevant paths under a threshold , producing a high-fidelity CCG. Evaluations on a large-scale dataset show dramatic reductions in graph size (CCG edges ~113 vs backtracking ~227k) and superior filtering performance (FP around 99 with FN=0), with real-time processing (2s per investigation) and modest memory (~30MB) and disk usage (~21MB). Overall, SPARSE delivers low-latency, low-overhead, high-precision attack investigation that can augment existing intrusion detection and forensics workflows. All results are framed with formal path-based metrics and real-time streaming guarantees, making SPARSE practical for enterprise deployments.

Abstract

As the complexity and destructiveness of Advanced Persistent Threat (APT) increase, there is a growing tendency to identify a series of actions undertaken to achieve the attacker's target, called attack investigation. Currently, analysts construct the provenance graph to perform causality analysis on Point-Of-Interest (POI) event for capturing critical events (related to the attack). However, due to the vast size of the provenance graph and the rarity of critical events, existing attack investigation methods suffer from problems of high false positives, high overhead, and high latency. To this end, we propose SPARSE, an efficient and real-time system for constructing critical component graphs (i.e., consisting of critical events) from streaming logs. Our key observation is 1) Critical events exist in a suspicious semantic graph (SSG) composed of interaction flows between suspicious entities, and 2) Information flows that accomplish attacker's goal exist in the form of paths. Therefore, SPARSE uses a two-stage framework to implement attack investigation (i.e., constructing the SSG and performing path-level contextual analysis). First, SPARSE operates in a state-based mode where events are consumed as streams, allowing easy access to the SSG related to the POI event through semantic transfer rule and storage strategy. Then, SPARSE identifies all suspicious flow paths (SFPs) related to the POI event from the SSG, quantifies the influence of each path to filter irrelevant events. Our evaluation on a real large-scale attack dataset shows that SPARSE can generate a critical component graph (~ 113 edges) in 1.6 seconds, which is 2014 X smaller than the backtracking graph (~ 227,589 edges). SPARSE is 25 X more effective than other state-of-the-art techniques in filtering irrelevant edges.
Paper Structure (34 sections, 4 equations, 7 figures, 7 tables, 2 algorithms)

This paper contains 34 sections, 4 equations, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: Partial dependency graph of one attack case Dataleak. The black dashed box indicates the backtracking graph ($\sim$ 200,000 edges) constructed from the POI event via backward propagation. The blue dashed box indicates the suspicious semantic graph ($\sim$ 27 edges) constructed by SParse. The red dashed line box indicates the critical component graph ($\sim$ 22 edges) exported by SParse.
  • Figure 2: Architecture of SParse.
  • Figure 3: An Example of Suspicious Semantic Transfer. The red solid line indicates that the entity carries suspicious semantic. SEL is short for Suspicious Entity List and RET is short for Relevant Event Table.
  • Figure 4: Suspicious flow path extraction and path-level contextual scoring.
  • Figure 5: Hyperparameter Matrices for System Performance with Different Parameters.
  • ...and 2 more figures