PacketCLIP: Multi-Modal Embedding of Network Traffic and Language for Cybersecurity Reasoning
Ryozo Masukawa, Sanggeon Yun, Sungheon Jeong, Wenjun Huang, Yang Ni, Ian Bryant, Nathaniel D. Bastian, Mohsen Imani
TL;DR
PacketCLIP tackles encrypted traffic classification by bridging packet data and natural language semantics through a multimodal embedding trained with contrastive pretraining. It fuses mission-specific knowledge graphs, NL explanations, and hierarchical GNN reasoning to enable robust, interpretable intrusion detection on resource-constrained devices. The approach delivers an 11.6% improvement in AUC over baselines and maintains about 95% mAUC with only 30% of training data, while dramatically reducing model size (92% fewer parameters) and FLOPs (98% less). These results demonstrate strong data-scarcity resilience, real-time applicability in IoT contexts, and a foundation for scalable, interpretable security reasoning across encrypted network traffic.
Abstract
Traffic classification is vital for cybersecurity, yet encrypted traffic poses significant challenges. We present PacketCLIP, a multi-modal framework combining packet data with natural language semantics through contrastive pretraining and hierarchical Graph Neural Network (GNN) reasoning. PacketCLIP integrates semantic reasoning with efficient classification, enabling robust detection of anomalies in encrypted network flows. By aligning textual descriptions with packet behaviors, it offers enhanced interpretability, scalability, and practical applicability across diverse security scenarios. PacketCLIP achieves a 95% mean AUC, outperforms baselines by 11.6%, and reduces model size by 92%, making it ideal for real-time anomaly detection. By bridging advanced machine learning techniques and practical cybersecurity needs, PacketCLIP provides a foundation for scalable, efficient, and interpretable solutions to tackle encrypted traffic classification and network intrusion detection challenges in resource-constrained environments.
