Table of Contents
Fetching ...

Towards Characterizing Cyber Networks with Large Language Models

Alaric Hartsock, Luiz Manella Pereira, Glenn Fink

TL;DR

This paper employs latent features of cyber data to find anomalies via a prototype tool called Cyber Log Embeddings Model (CLEM), which was trained on Zeek network traffic logs from both a real-world production network and an from Internet of Things cybersecurity testbed.

Abstract

Threat hunting analyzes large, noisy, high-dimensional data to find sparse adversarial behavior. We believe adversarial activities, however they are disguised, are extremely difficult to completely obscure in high dimensional space. In this paper, we employ these latent features of cyber data to find anomalies via a prototype tool called Cyber Log Embeddings Model (CLEM). CLEM was trained on Zeek network traffic logs from both a real-world production network and an from Internet of Things (IoT) cybersecurity testbed. The model is deliberately overtrained on a sliding window of data to characterize each window closely. We use the Adjusted Rand Index (ARI) to comparing the k-means clustering of CLEM output to expert labeling of the embeddings. Our approach demonstrates that there is promise in using natural language modeling to understand cyber data.

Towards Characterizing Cyber Networks with Large Language Models

TL;DR

This paper employs latent features of cyber data to find anomalies via a prototype tool called Cyber Log Embeddings Model (CLEM), which was trained on Zeek network traffic logs from both a real-world production network and an from Internet of Things cybersecurity testbed.

Abstract

Threat hunting analyzes large, noisy, high-dimensional data to find sparse adversarial behavior. We believe adversarial activities, however they are disguised, are extremely difficult to completely obscure in high dimensional space. In this paper, we employ these latent features of cyber data to find anomalies via a prototype tool called Cyber Log Embeddings Model (CLEM). CLEM was trained on Zeek network traffic logs from both a real-world production network and an from Internet of Things (IoT) cybersecurity testbed. The model is deliberately overtrained on a sliding window of data to characterize each window closely. We use the Adjusted Rand Index (ARI) to comparing the k-means clustering of CLEM output to expert labeling of the embeddings. Our approach demonstrates that there is promise in using natural language modeling to understand cyber data.

Paper Structure

This paper contains 9 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Dimensionally reduced connection and address embeddings for the PNNL and ACI data. Colors are assigned to groups that appear in the data and there are many more categories than distinguishable colors. We color the nodes to show the degree of homogeneity of the clusters only.
  • Figure 2: Calculated Adjusted Rand Index vs Number of Clusters