Table of Contents
Fetching ...

DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation

Jie Ying, Tiantian Zhu, Mingqi Lv, Tieming Chen

TL;DR

The paper tackles the growing storage burden of provenance graphs derived from kernel audit logs and presents Dehydrator, a three-stage system that combines field mapping encoding, hierarchical encoding, and a decoder-only Transformer with an error-correction mechanism to enable efficient storage and batch querying. It preserves data losslessly while dramatically reducing storage overhead and providing query support, demonstrated across seven large datasets with over one billion logs and showing up to 84.55% savings and substantial speed advantages over traditional databases. Key contributions include a practical hierarchy-aware encoding of incoming edges, a DNN-based storage framework tailored for batch queries, and a formal analysis of component impacts, applicable scenarios, and model capacity trade-offs. The approach offers a scalable, cold-storage solution for forensic provenance analysis, with potential to evolve toward broader graph-storage use cases through open-sourcing and further optimization.

Abstract

As the scope and impact of cyber threats have expanded, analysts utilize audit logs to hunt threats and investigate attacks. The provenance graphs constructed from kernel logs are increasingly considered as an ideal data source due to their powerful semantic expression and attack historic correlation ability. However, storing provenance graphs with traditional databases faces the challenge of high storage overhead, given the high frequency of kernel events and the persistence of attacks. To address this, we propose Dehydrator, an efficient provenance graph storage system. For the logs generated by auditing frameworks, Dehydrator uses field mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and finally learns a deep neural network to support batch querying. We have conducted evaluations on seven datasets totaling over one billion log entries. Experimental results show that Dehydrator reduces the storage space by 84.55%. Dehydrator is 7.36 times more efficient than PostgreSQL, 7.16 times than Neo4j, and 16.17 times than Leonard (the work most closely related to Dehydrator, published at Usenix Security'23).

DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation

TL;DR

The paper tackles the growing storage burden of provenance graphs derived from kernel audit logs and presents Dehydrator, a three-stage system that combines field mapping encoding, hierarchical encoding, and a decoder-only Transformer with an error-correction mechanism to enable efficient storage and batch querying. It preserves data losslessly while dramatically reducing storage overhead and providing query support, demonstrated across seven large datasets with over one billion logs and showing up to 84.55% savings and substantial speed advantages over traditional databases. Key contributions include a practical hierarchy-aware encoding of incoming edges, a DNN-based storage framework tailored for batch queries, and a formal analysis of component impacts, applicable scenarios, and model capacity trade-offs. The approach offers a scalable, cold-storage solution for forensic provenance analysis, with potential to evolve toward broader graph-storage use cases through open-sourcing and further optimization.

Abstract

As the scope and impact of cyber threats have expanded, analysts utilize audit logs to hunt threats and investigate attacks. The provenance graphs constructed from kernel logs are increasingly considered as an ideal data source due to their powerful semantic expression and attack historic correlation ability. However, storing provenance graphs with traditional databases faces the challenge of high storage overhead, given the high frequency of kernel events and the persistence of attacks. To address this, we propose Dehydrator, an efficient provenance graph storage system. For the logs generated by auditing frameworks, Dehydrator uses field mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and finally learns a deep neural network to support batch querying. We have conducted evaluations on seven datasets totaling over one billion log entries. Experimental results show that Dehydrator reduces the storage space by 84.55%. Dehydrator is 7.36 times more efficient than PostgreSQL, 7.16 times than Neo4j, and 16.17 times than Leonard (the work most closely related to Dehydrator, published at Usenix Security'23).
Paper Structure (21 sections, 6 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 21 sections, 6 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of Dehydrator
  • Figure 2: Hierarchical Encoding.
  • Figure 3: Storage Overhead and time costs of Individual Components.
  • Figure 4: Impact of Model Capacity.