Table of Contents
Fetching ...

PacketCLIP: Multi-Modal Embedding of Network Traffic and Language for Cybersecurity Reasoning

Ryozo Masukawa, Sanggeon Yun, Sungheon Jeong, Wenjun Huang, Yang Ni, Ian Bryant, Nathaniel D. Bastian, Mohsen Imani

TL;DR

PacketCLIP tackles encrypted traffic classification by bridging packet data and natural language semantics through a multimodal embedding trained with contrastive pretraining. It fuses mission-specific knowledge graphs, NL explanations, and hierarchical GNN reasoning to enable robust, interpretable intrusion detection on resource-constrained devices. The approach delivers an 11.6% improvement in AUC over baselines and maintains about 95% mAUC with only 30% of training data, while dramatically reducing model size (92% fewer parameters) and FLOPs (98% less). These results demonstrate strong data-scarcity resilience, real-time applicability in IoT contexts, and a foundation for scalable, interpretable security reasoning across encrypted network traffic.

Abstract

Traffic classification is vital for cybersecurity, yet encrypted traffic poses significant challenges. We present PacketCLIP, a multi-modal framework combining packet data with natural language semantics through contrastive pretraining and hierarchical Graph Neural Network (GNN) reasoning. PacketCLIP integrates semantic reasoning with efficient classification, enabling robust detection of anomalies in encrypted network flows. By aligning textual descriptions with packet behaviors, it offers enhanced interpretability, scalability, and practical applicability across diverse security scenarios. PacketCLIP achieves a 95% mean AUC, outperforms baselines by 11.6%, and reduces model size by 92%, making it ideal for real-time anomaly detection. By bridging advanced machine learning techniques and practical cybersecurity needs, PacketCLIP provides a foundation for scalable, efficient, and interpretable solutions to tackle encrypted traffic classification and network intrusion detection challenges in resource-constrained environments.

PacketCLIP: Multi-Modal Embedding of Network Traffic and Language for Cybersecurity Reasoning

TL;DR

PacketCLIP tackles encrypted traffic classification by bridging packet data and natural language semantics through a multimodal embedding trained with contrastive pretraining. It fuses mission-specific knowledge graphs, NL explanations, and hierarchical GNN reasoning to enable robust, interpretable intrusion detection on resource-constrained devices. The approach delivers an 11.6% improvement in AUC over baselines and maintains about 95% mAUC with only 30% of training data, while dramatically reducing model size (92% fewer parameters) and FLOPs (98% less). These results demonstrate strong data-scarcity resilience, real-time applicability in IoT contexts, and a foundation for scalable, interpretable security reasoning across encrypted network traffic.

Abstract

Traffic classification is vital for cybersecurity, yet encrypted traffic poses significant challenges. We present PacketCLIP, a multi-modal framework combining packet data with natural language semantics through contrastive pretraining and hierarchical Graph Neural Network (GNN) reasoning. PacketCLIP integrates semantic reasoning with efficient classification, enabling robust detection of anomalies in encrypted network flows. By aligning textual descriptions with packet behaviors, it offers enhanced interpretability, scalability, and practical applicability across diverse security scenarios. PacketCLIP achieves a 95% mean AUC, outperforms baselines by 11.6%, and reduces model size by 92%, making it ideal for real-time anomaly detection. By bridging advanced machine learning techniques and practical cybersecurity needs, PacketCLIP provides a foundation for scalable, efficient, and interpretable solutions to tackle encrypted traffic classification and network intrusion detection challenges in resource-constrained environments.

Paper Structure

This paper contains 16 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Semantic AI framework for detecting traffic related to specific cyber-attacks defined by a user([fill color=black,inner color=white,]a[fill color=black,inner color=white,]a), combining LLM-driven knowledge graphs([fill color=black,inner color=white,]b[fill color=black,inner color=white,]b), PacketCLIP alignment([fill color=black,inner color=white,]c[fill color=black,inner color=white,]c), hierarchical reasoning([fill color=black,inner color=white,]d[fill color=black,inner color=white,]d), and Tiny AI to enable efficient, interpretable, and traceable detection on low-resource devices([fill color=black,inner color=white,]e[fill color=black,inner color=white,]e).
  • Figure 2: A framework to generate NL explanations for intrusion scenarios by mapping tabular security flow data([fill color=black,inner color=white,]1[fill color=black,inner color=white,]1) to text templates ([fill color=black,inner color=white,]2[fill color=black,inner color=white,]2), leveraging LLM-generated knowledge graphs ([fill color=black,inner color=white,]3[fill color=black,inner color=white,]3), utilizing LLMs for paraphrased explanations ([fill color=black,inner color=white,]4[fill color=black,inner color=white,]4), and producing interpretable descriptions of network events ([fill color=black,inner color=white,]5[fill color=black,inner color=white,]5).
  • Figure 3: (a) The overall architecture of the contrastive pretraining process for PacketCLIP , including encoding packets and paired texts for learning. (b) A mission-specific hierarchical GNN framework that integrates PacketCLIP with temporal models and classifiers to derive intrusion detection results.
  • Figure 4: Zero-shot accuracy change during training shows a trade-off: SSL on both encoders improves faster but is less stable, while SSL only on the packet encoder progresses slower but is more stable.
  • Figure 5: Word clouds and top 10 frequent vocabularies for DoS, Brute Force, and Reconnaissance missions from ACI-IoT-2023, highlighting key terms like 'botnet,' 'credential,' and 'port scanning' for respective categories.
  • ...and 2 more figures