Table of Contents
Fetching ...

LogPrécis: Unleashing Language Models for Automated Malicious Log Analysis

Matteo Boffa, Rodolfo Vieira Valentim, Luca Vassio, Danilo Giordano, Idilio Drago, Marco Mellia, Zied Ben Houidi

TL;DR

This paper systematically study how to benefit from the state-of-the-art in LM to automatically analyze text-like Unix shell attack logs to paves the way for better and more responsive defense against cyberattacks.

Abstract

The collection of security-related logs holds the key to understanding attack behaviors and diagnosing vulnerabilities. Still, their analysis remains a daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched potential in understanding natural and programming languages. The question arises whether and how LMs could be also useful for security experts since their logs contain intrinsically confused and obfuscated information. In this paper, we systematically study how to benefit from the state-of-the-art in LM to automatically analyze text-like Unix shell attack logs. We present a thorough design methodology that leads to LogPrécis. It receives as input raw shell sessions and automatically identifies and assigns the attacker tactic to each portion of the session, i.e., unveiling the sequence of the attacker's goals. We demonstrate LogPrécis capability to support the analysis of two large datasets containing about 400,000 unique Unix shell attacks. LogPrécis reduces them into about 3,000 fingerprints, each grouping sessions with the same sequence of tactics. The abstraction it provides lets the analyst better understand attacks, identify fingerprints, detect novelty, link similar attacks, and track families and mutations. Overall, LogPrécis, released as open source, paves the way for better and more responsive defense against cyberattacks.

LogPrécis: Unleashing Language Models for Automated Malicious Log Analysis

TL;DR

This paper systematically study how to benefit from the state-of-the-art in LM to automatically analyze text-like Unix shell attack logs to paves the way for better and more responsive defense against cyberattacks.

Abstract

The collection of security-related logs holds the key to understanding attack behaviors and diagnosing vulnerabilities. Still, their analysis remains a daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched potential in understanding natural and programming languages. The question arises whether and how LMs could be also useful for security experts since their logs contain intrinsically confused and obfuscated information. In this paper, we systematically study how to benefit from the state-of-the-art in LM to automatically analyze text-like Unix shell attack logs. We present a thorough design methodology that leads to LogPrécis. It receives as input raw shell sessions and automatically identifies and assigns the attacker tactic to each portion of the session, i.e., unveiling the sequence of the attacker's goals. We demonstrate LogPrécis capability to support the analysis of two large datasets containing about 400,000 unique Unix shell attacks. LogPrécis reduces them into about 3,000 fingerprints, each grouping sessions with the same sequence of tactics. The abstraction it provides lets the analyst better understand attacks, identify fingerprints, detect novelty, link similar attacks, and track families and mutations. Overall, LogPrécis, released as open source, paves the way for better and more responsive defense against cyberattacks.
Paper Structure (31 sections, 18 figures, 5 tables)

This paper contains 31 sections, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Example of a session, definition of statements, words, tokens, and their classification into MITRE tactics.
  • Figure 2: Choices for adopting PLMs in a security pipeline. As alternatives, we also test GPT-3 and classic approaches after fixing the best combination for other choices.
  • Figure 3: ROUGE-1 vs. Fidelity for different chunking strategies (HaaS dataset). 18 points per strategy. Each metric is averaged over 5 seeds. Context chunking is the winning strategy.
  • Figure 4: Benefit of domain adaptation (HaaS dataset). Arrows link the same model and task without and with it. Domain adaptation improves the performance 8 times out of 9.
  • Figure 5: Classification metrics for the best model (HaaS dataset). Error bars report the variance among the 5 different splits.
  • ...and 13 more figures