Table of Contents
Fetching ...

Reconstructing Fine-Grained Network Data using Autoencoder Architectures with Domain Knowledge Penalties

Mark Cheung, Sridhar Venkatesan

TL;DR

This work tackles reconstructing fine-grained network session details from coarse feature representations to enable privacy-preserving, storage-efficient threat analytics. It introduces autoencoder architectures augmented with domain knowledge penalties via a Knowledge-Augmented Loss (KAL) and a Constraint Enforcement Module to ensure protocol-consistent reconstructions of PCAP headers. Transformer-based autoencoders with a combined $CE$ and $MSE$ loss achieve superior reconstruction, especially for categorical features, and session-level encoding provides robustness under missing data. The approach lowers data storage needs while maintaining fidelity for security-relevant signals, with practical implications for data-efficient training and improved threat detection in network security pipelines.

Abstract

The ability to reconstruct fine-grained network session data, including individual packets, from coarse-grained feature vectors is crucial for improving network security models. However, the large-scale collection and storage of raw network traffic pose significant challenges, particularly for capturing rare cyberattack samples. These challenges hinder the ability to retain comprehensive datasets for model training and future threat detection. To address this, we propose a machine learning approach guided by formal methods to encode and reconstruct network data. Our method employs autoencoder models with domain-informed penalties to impute PCAP session headers from structured feature representations. Experimental results demonstrate that incorporating domain knowledge through constraint-based loss terms significantly improves reconstruction accuracy, particularly for categorical features with session-level encodings. By enabling efficient reconstruction of detailed network sessions, our approach facilitates data-efficient model training while preserving privacy and storage efficiency.

Reconstructing Fine-Grained Network Data using Autoencoder Architectures with Domain Knowledge Penalties

TL;DR

This work tackles reconstructing fine-grained network session details from coarse feature representations to enable privacy-preserving, storage-efficient threat analytics. It introduces autoencoder architectures augmented with domain knowledge penalties via a Knowledge-Augmented Loss (KAL) and a Constraint Enforcement Module to ensure protocol-consistent reconstructions of PCAP headers. Transformer-based autoencoders with a combined and loss achieve superior reconstruction, especially for categorical features, and session-level encoding provides robustness under missing data. The approach lowers data storage needs while maintaining fidelity for security-relevant signals, with practical implications for data-efficient training and improved threat detection in network security pipelines.

Abstract

The ability to reconstruct fine-grained network session data, including individual packets, from coarse-grained feature vectors is crucial for improving network security models. However, the large-scale collection and storage of raw network traffic pose significant challenges, particularly for capturing rare cyberattack samples. These challenges hinder the ability to retain comprehensive datasets for model training and future threat detection. To address this, we propose a machine learning approach guided by formal methods to encode and reconstruct network data. Our method employs autoencoder models with domain-informed penalties to impute PCAP session headers from structured feature representations. Experimental results demonstrate that incorporating domain knowledge through constraint-based loss terms significantly improves reconstruction accuracy, particularly for categorical features with session-level encodings. By enabling efficient reconstruction of detailed network sessions, our approach facilitates data-efficient model training while preserving privacy and storage efficiency.

Paper Structure

This paper contains 16 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the proposed approach to encode and reconstruct fine-grained network sessions. Training is augmented with loss functions based on domain constraints and at deployment, the outputs of the model are modified to comply with the domain specification.
  • Figure 2: End-to-end architecture of the proposed framework
  • Figure 3: Model Comparison
  • Figure 4: Comparison between loss functions. CE: Cross-Entropy loss , MSE: Mean-Squared Error loss
  • Figure 5: Reconstruction loss comparison between session-level and packet-level encoding with varying dropout rates.