Semantic-Aware Parsing for Security Logs
Julien Piet, Vivian Fang, Rishi Khare, Scott Coull, Vern Paxson, Raluca Ada Popa, David Wagner
TL;DR
Security logs are highly heterogeneous and evolve in format, making manual parser authoring costly and brittle. Matryoshka tackles this by an end-to-end pipeline that uses LLMs only for generation to produce deterministic, semantically-aware parsers through three stages—syntax parsing, semantic naming, and taxonomy mapping—supported by a rigorous validation loop. It achieves end-to-end parser quality comparable to expert human parsers, while delivering superior queryability, especially when using custom semantic schemas, and operates at ingestion speeds suitable for large-scale security workloads. The work demonstrates a practical AI-assisted path to automated, scalable security analytics, reducing manual effort while enabling precise, rapid threat detection and incident reconstruction.
Abstract
Security logs are foundational to threat detection and post-incident investigation, yet analysts often struggle to fully leverage them due to their heterogeneity and unstructured nature. The standard practice of manually writing parsers to normalize the data in security event management systems is time-consuming and costly due to the long tail of log formats. Meanwhile, querying raw logs without explicit parsing using large language models (LLMs) is impractical at scale. In this paper, we introduce Matryoshka, an end-to-end system leveraging LLMs to automatically generate semantically-aware structured log parsers without labeled examples or human intervention. Matryoshka achieves this by directly inferring log syntax, variable naming, and normalization to common security-specific schemas (e.g., OCSF [1]) from unlabeled log line samples, then generating deterministic parsers and mapping rules that can be efficiently applied during data ingest. This approach provides analysts with semantically-rich data representations at scale, facilitating rapid and precise log search without the traditional burden of manual parser construction. We evaluate Matryoshka's capabilities through both established template generation datasets and new datasets curated to establish end-to-end performance on a realistic distribution of log types. Our experiments show that Matryoshka outperforms prior work on syntax parsing while matching human-generated parsers in both side-by-side comparisons and retrieval for security-relevant queries. These results demonstrate that Matryoshka significantly reduces manual effort by automatically extracting and organizing valuable security data, moving us closer to fully automated, AI-driven analytics.
