Table of Contents
Fetching ...

LESS: Efficient Log Storage System Based on Learned Model and Minimum Attribute Tree

Zhiyang Cheng, Zizhen Zhu, Haoran Dang, Hai Wan, Xibin Zhao

TL;DR

A novel provenance graph storage system, LESS, is proposed, which consumes smaller storage space and supports faster storage and queries compared to current approaches, and which reduces 6.29 times in storage time and achieves an 18.3 times faster query speed.

Abstract

In recent years, cyber attacks have become increasingly sophisticated and persistent. Detection and investigation based on the provenance graph can effectively mitigate cyber intrusion. However, in the long time span of defenses, the sheer size of the provenance graph will pose significant challenges to the storage systems. Faced with long-term storage tasks, existing methods are unable to simultaneously achieve lossless information, efficient compression, and fast query support. In this paper, we propose a novel provenance graph storage system, LESS, which consumes smaller storage space and supports faster storage and queries compared to current approaches. We innovatively partition the provenance graph into two distinct components, the graph structure and attribute, and store them separately. Based on their respective characteristics, we devise two appropriate storage schemes: the provenance graph structure storage method based on machine learning and the use of the minimal spanning tree to store the graph attributes. Compared with the state-of-the-art approach, LEONARD, LESS reduces 6.29 times in storage time, while also achieving a 5.24 times reduction in disk usage and an 18.3 times faster query speed while using only 11.5% of the memory on DARPA TC dataset.

LESS: Efficient Log Storage System Based on Learned Model and Minimum Attribute Tree

TL;DR

A novel provenance graph storage system, LESS, is proposed, which consumes smaller storage space and supports faster storage and queries compared to current approaches, and which reduces 6.29 times in storage time and achieves an 18.3 times faster query speed.

Abstract

In recent years, cyber attacks have become increasingly sophisticated and persistent. Detection and investigation based on the provenance graph can effectively mitigate cyber intrusion. However, in the long time span of defenses, the sheer size of the provenance graph will pose significant challenges to the storage systems. Faced with long-term storage tasks, existing methods are unable to simultaneously achieve lossless information, efficient compression, and fast query support. In this paper, we propose a novel provenance graph storage system, LESS, which consumes smaller storage space and supports faster storage and queries compared to current approaches. We innovatively partition the provenance graph into two distinct components, the graph structure and attribute, and store them separately. Based on their respective characteristics, we devise two appropriate storage schemes: the provenance graph structure storage method based on machine learning and the use of the minimal spanning tree to store the graph attributes. Compared with the state-of-the-art approach, LEONARD, LESS reduces 6.29 times in storage time, while also achieving a 5.24 times reduction in disk usage and an 18.3 times faster query speed while using only 11.5% of the memory on DARPA TC dataset.

Paper Structure

This paper contains 22 sections, 4 equations, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: A Provenance Graph Example.
  • Figure 2: Overview of storage process of LESS: We input the provenance graph in Figure \ref{['provenancegraph']} to LESS. First, we split the provenance graph into graph structure and graph attribute, and then store them separately. In the structure storage, we process the graph structure into vector (A.1) for training an XGBoost model (A.2). Based on the outputs of XGBoost, a calibration table is generated (A.3). In the attribute storage, we firstly process the node attributes into vectors using the Bag-of-words method, and then calculate the Manhattan distance between two vectors to obtain a similarity matrix with a window size of 3 (B.1). For example, the similarity matrix element 2 represents $d(v1,v2)=\mid4-3\mid+\mid1-2\mid+\mid1-1\mid=2$. The similarity matrix can be viewed as a representation of an undirected graph, where setting the maximum distance to 3 yields the minimum spanning tree (B.2.1), representing the final attribute tree structure. Finally, the corresponding edit operations are generated based on the tree structure to obtain the minimum attribute tree (B.2.2). The trained model, calibration table, and minimum attribute trees are the final outputs. (To simplify the illustration, the node attributes in Figure \ref{['provenancegraph']} are simplified. Since the storage process for edges is the same as that for nodes, we do not repeat the edge storage process here.)
  • Figure 3: Locality on different logs.
  • Figure 4: Overview of the querying process of LESS: We input (n3, 3) to query detailed information about node n3 and its subsequent 2 nodes, along with the related edges. In a cold query, LESS first restores the graph structure, obtain the IDs of targeted nodes and edges in the graph structure, then inputs the IDs to the minimum attribute tree and retrieves detailed information about all nodes and edges. (For simplicity, the edge querying process is not illustrated here; this process follows the same procedure as the above node querying process.)
  • Figure 5: Performance of LESS with the Bag-of-words and Manhattan distance and Edit Distance
  • ...and 2 more figures