Table of Contents
Fetching ...

The Needle is a Thread: Finding Planted Paths in Noisy Process Trees

Maya Le, Paweł Prałat, Aaron Smith, François Théberge

TL;DR

The paper introduces the planted path problem in noisy labelled trees and proposes a polynomial-time fuzzy matching algorithm based on bottom-up dynamic programming to find high-scoring partial matchings between two trees. It formalizes a feature-based similarity s(𝒢,𝒟) and demonstrates how the matching results can serve as building blocks for unsupervised template discovery and classifier-enhancement workflows. Through synthetic planted-path models and experiments on the real ACME4 dataset, the authors show that planted-path signals can be recovered or amplified even in noisy, bushy trees, and that matches can be leveraged for downstream tasks such as clustering and threat detection. The work provides practical tools for extracting meaningful sequences in cybersecurity logs and related domains, with implications for signal aggregation and scalable analysis of large, noisy process trees.

Abstract

Motivated by applications in cybersecurity such as finding meaningful sequences of malware-related events buried inside large amounts of computer log data, we introduce the "planted path" problem and propose an algorithm to find fuzzy matchings between two trees. This algorithm can be used as a "building block" for more complicated workflows. We demonstrate usefulness of a few of such workflows in mining synthetically generated data as well as real-world ACME cybersecurity datasets.

The Needle is a Thread: Finding Planted Paths in Noisy Process Trees

TL;DR

The paper introduces the planted path problem in noisy labelled trees and proposes a polynomial-time fuzzy matching algorithm based on bottom-up dynamic programming to find high-scoring partial matchings between two trees. It formalizes a feature-based similarity s(𝒢,𝒟) and demonstrates how the matching results can serve as building blocks for unsupervised template discovery and classifier-enhancement workflows. Through synthetic planted-path models and experiments on the real ACME4 dataset, the authors show that planted-path signals can be recovered or amplified even in noisy, bushy trees, and that matches can be leveraged for downstream tasks such as clustering and threat detection. The work provides practical tools for extracting meaningful sequences in cybersecurity logs and related domains, with implications for signal aggregation and scalable analysis of large, noisy process trees.

Abstract

Motivated by applications in cybersecurity such as finding meaningful sequences of malware-related events buried inside large amounts of computer log data, we introduce the "planted path" problem and propose an algorithm to find fuzzy matchings between two trees. This algorithm can be used as a "building block" for more complicated workflows. We demonstrate usefulness of a few of such workflows in mining synthetically generated data as well as real-world ACME cybersecurity datasets.
Paper Structure (18 sections, 3 theorems, 7 equations, 9 figures, 4 algorithms)

This paper contains 18 sections, 3 theorems, 7 equations, 9 figures, 4 algorithms.

Key Result

Theorem 1

Fix $\mathcal{G},\mathcal{H}, \phi_{\mathcal{G}},\phi_{\mathcal{H}},w$. Let $a_{1:L}, A$ be the output of Algorithm alg:basic_match with this input. Then $a_{1:L}$ is a valid matching, and the score of $a_{1:L}$ satisfies

Figures (9)

  • Figure 1: Example of a matching of two trees.
  • Figure 2: Example of a typical tree (left) and histogram of similarity scores for the two classes (right).
  • Figure 3: Histograms of symbol frequencies.
  • Figure 4: Embedding of trees coloured by the "ground-truth" cluster label (left). Similarity score for the extracted exemplars (right).
  • Figure 5: Two embeddings of six classes of trees, coloured by the "ground-truth" cluster label: weighted (left) and "unweighted" (right).
  • ...and 4 more figures

Theorems & Definitions (9)

  • Definition 1: Trees and Orderings
  • Definition 2: Valid (Partial) Matchings
  • Remark 1: When do we expect to find the "ground truth" path?
  • Theorem 1
  • Lemma 1: Score Correctness
  • Lemma 2: Path Correctness
  • proof : Proof of Lemma \ref{['lemma_basic_correctness_score']}
  • proof : Sketch of the Proof of Lemma \ref{['lemma_basic_correctness_path']}
  • Remark 2: Complexity of Rejection Sampling