Table of Contents
Fetching ...

Semantic-Aware Parsing for Security Logs

Julien Piet, Vivian Fang, Rishi Khare, Scott Coull, Vern Paxson, Raluca Ada Popa, David Wagner

TL;DR

Security logs are highly heterogeneous and evolve in format, making manual parser authoring costly and brittle. Matryoshka tackles this by an end-to-end pipeline that uses LLMs only for generation to produce deterministic, semantically-aware parsers through three stages—syntax parsing, semantic naming, and taxonomy mapping—supported by a rigorous validation loop. It achieves end-to-end parser quality comparable to expert human parsers, while delivering superior queryability, especially when using custom semantic schemas, and operates at ingestion speeds suitable for large-scale security workloads. The work demonstrates a practical AI-assisted path to automated, scalable security analytics, reducing manual effort while enabling precise, rapid threat detection and incident reconstruction.

Abstract

Security logs are foundational to threat detection and post-incident investigation, yet analysts often struggle to fully leverage them due to their heterogeneity and unstructured nature. The standard practice of manually writing parsers to normalize the data in security event management systems is time-consuming and costly due to the long tail of log formats. Meanwhile, querying raw logs without explicit parsing using large language models (LLMs) is impractical at scale. In this paper, we introduce Matryoshka, an end-to-end system leveraging LLMs to automatically generate semantically-aware structured log parsers without labeled examples or human intervention. Matryoshka achieves this by directly inferring log syntax, variable naming, and normalization to common security-specific schemas (e.g., OCSF [1]) from unlabeled log line samples, then generating deterministic parsers and mapping rules that can be efficiently applied during data ingest. This approach provides analysts with semantically-rich data representations at scale, facilitating rapid and precise log search without the traditional burden of manual parser construction. We evaluate Matryoshka's capabilities through both established template generation datasets and new datasets curated to establish end-to-end performance on a realistic distribution of log types. Our experiments show that Matryoshka outperforms prior work on syntax parsing while matching human-generated parsers in both side-by-side comparisons and retrieval for security-relevant queries. These results demonstrate that Matryoshka significantly reduces manual effort by automatically extracting and organizing valuable security data, moving us closer to fully automated, AI-driven analytics.

Semantic-Aware Parsing for Security Logs

TL;DR

Security logs are highly heterogeneous and evolve in format, making manual parser authoring costly and brittle. Matryoshka tackles this by an end-to-end pipeline that uses LLMs only for generation to produce deterministic, semantically-aware parsers through three stages—syntax parsing, semantic naming, and taxonomy mapping—supported by a rigorous validation loop. It achieves end-to-end parser quality comparable to expert human parsers, while delivering superior queryability, especially when using custom semantic schemas, and operates at ingestion speeds suitable for large-scale security workloads. The work demonstrates a practical AI-assisted path to automated, scalable security analytics, reducing manual effort while enabling precise, rapid threat detection and incident reconstruction.

Abstract

Security logs are foundational to threat detection and post-incident investigation, yet analysts often struggle to fully leverage them due to their heterogeneity and unstructured nature. The standard practice of manually writing parsers to normalize the data in security event management systems is time-consuming and costly due to the long tail of log formats. Meanwhile, querying raw logs without explicit parsing using large language models (LLMs) is impractical at scale. In this paper, we introduce Matryoshka, an end-to-end system leveraging LLMs to automatically generate semantically-aware structured log parsers without labeled examples or human intervention. Matryoshka achieves this by directly inferring log syntax, variable naming, and normalization to common security-specific schemas (e.g., OCSF [1]) from unlabeled log line samples, then generating deterministic parsers and mapping rules that can be efficiently applied during data ingest. This approach provides analysts with semantically-rich data representations at scale, facilitating rapid and precise log search without the traditional burden of manual parser construction. We evaluate Matryoshka's capabilities through both established template generation datasets and new datasets curated to establish end-to-end performance on a realistic distribution of log types. Our experiments show that Matryoshka outperforms prior work on syntax parsing while matching human-generated parsers in both side-by-side comparisons and retrieval for security-relevant queries. These results demonstrate that Matryoshka significantly reduces manual effort by automatically extracting and organizing valuable security data, moving us closer to fully automated, AI-driven analytics.

Paper Structure

This paper contains 33 sections, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Parsers convert unstructured log lines (top) to structured data suitable for ingestion into a database (upper-right), in a sequence of three steps. The output format makes it easy to query logs for events that satisfy certain conditions (lower-right). Matryoshka creates such parsers by learning the log's syntax ①, crafting a schema ②, and mapping to a taxonomy ③.
  • Figure 2: Example parsing tree, with two templates that share a common prefix. Each node/token represents either a string constant (black) or a variable (with associated regular expression; blue), and each leaf corresponds to a template (given by the path from the root to that leaf).
  • Figure 3: Matryoshka creates parsers in three stages, each subdivided into two logical steps. Matryoshka generates candidate solutions for log lines sequentially, then it validates batches of lines to enforce consistency. Validation produces code to edit the parse tree, iterating until valid.