A Benign Activity Extraction Method for Malignant Activity Identification using Data Provenance
Taishin Saito
TL;DR
This work tackles dependency explosion in Data Provenance-based attack analysis by extracting frequently occurring benign activities from log data using NLP and similarity analysis, forming node-sets and labeling them to remove benign activity from provenance graphs. The method consists of four phases: data preprocessing, node-set construction, node-set labeling, and benign-activity extraction with graph reduction, operationalized with a vector representation $V = w_{c} V_{c} + \sum w_{f_{i}} V_{f_{i}} + \sum w_{n_{i}} V_{n_{i}}$ and $w_{f_{i}} = \log \frac{P}{P_{f_{i}}}$, where node-sets comprise five nodes. Evaluations on the DARPA TC Data across E3 Theia, E5 Theia, and E5 Marple show that about $6.8\%$ to $39\%$ of activities are benign patterns and that removing benign activities extracted from as little as $1.4\%$ to $3.2\%$ of log data can reduce the search space by up to about $52\%$, with no observed false negatives in the labeled datasets. The approach demonstrates robustness across datasets and suggests practical benefits for intrusion detection systems and OT environments by enabling activity-pattern whitelisting and reduced analyst workload. Future work includes extending to edge-based classification, broadening data-format compatibility, and optimizing parameters to improve generalization and real-time applicability.
Abstract
In order to understand the overall picture of cyber attacks and to identify the source of cyber attacks, a method to identify malicious activities by automatically creating a graph that ties together the dependencies of a series of related events by tracking Data Provenance has been developed. However, the problem of dependency explosion, in which a large number of normal computer system operations such as operations by authorized users are included in the dependencies, results in a huge generated graph, making it difficult to identify malicious activities. In this paper, we propose a method to reduce the search space for malicious activities by extracting and removing frequently occurring benign activities through natural language processing of log data and analysis of activities in the computer system using similarity judgments. In the evaluation experiment, we used the DARPA TC Dateset, a large-scale public dataset, to evaluate the effectiveness of the proposed method on the dependency explosion problem. In addition, we showed that about 6.8 to 39% of the activities in a computer system could be defined as patterns of benign activities. In addition, we showed that removing benign activities extracted from a portion of the log data (approximately 1.4% to 3.2% in size) can significantly reduce the search space (up to approximately 52%) in large data sets.
