Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification

Sizhe Huang; Shujie Yang

Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification

Sizhe Huang, Shujie Yang

TL;DR

This work proposes a protocol-native paradigm that treats protocol-defined field semantics as architectural priors, reformulating the task to align with the data's intrinsic tabular modality rather than incrementally adapting sequence-based architectures, and introduces FlowSem-MAE, a tabular masked autoencoder built on Flow Semantic Units.

Abstract

Self-supervised masked modeling shows promise for encrypted traffic classification by masking and reconstructing raw bytes. Yet recent work reveals these methods fail to reduce reliance on labeled data despite costly pretraining: under frozen encoder evaluation, accuracy drops from greater than 0.9 to less than 0.47. We argue the root cause is inductive bias mismatch: flattening traffic into byte sequences destroys protocol-defined semantics. We identify three specific issues: 1) field unpredictability, random fields like ip.id are unlearnable yet treated as reconstruction targets; 2) embedding confusion, semantically distinct fields collapse into a unified embedding space; 3) metadata loss, capture-time metadata essential for temporal analysis is discarded. To address this, we propose a protocol-native paradigm that treats protocol-defined field semantics as architectural priors, reformulating the task to align with the data's intrinsic tabular modality rather than incrementally adapting sequence-based architectures. Instantiating this paradigm, we introduce FlowSem-MAE, a tabular masked autoencoder built on Flow Semantic Units (FSUs). It features predictability-guided filtering that focuses on learnable FSUs, FSU-specific embeddings to preserve field boundaries, and dual-axis attention to capture intra-packet and temporal patterns. FlowSem-MAE significantly outperforms state-of-the-art across datasets. With only half labeled data, it outperforms most existing methods trained on full data.

Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification

TL;DR

Abstract

Paper Structure (23 sections, 6 equations, 7 figures, 4 tables)

This paper contains 23 sections, 6 equations, 7 figures, 4 tables.

Introduction
Motivation: Limited Transferability
Key Insight: Protocol-Native Modeling
Related Work
Statistical and Expert-Based Approaches
Masked Language Modeling for Traffic
Masked Vision Modeling for Traffic
Rethinking Traffic Representation Learning
Method
Framework Overview
FSU Extraction and Preprocessing
Predictability-Guided Filtering
FSU-Specific Embeddings
Dual-Axis Transformer Architecture
Experiments
...and 8 more sections

Figures (7)

Figure 1: Protocol fields (left) are flattened into raw bytes (middle) and embedded (right), illustrating inductive bias mismatch at three levels: (P1) Field-level unpredictability: Random fields (pink) are treated as learnable despite being unpredictable by protocol design (e.g., ip.id and checksum). (P2) Cross-field-level embedding confusion: Field distinctions are lost through cross-field embedding (grey), where adjacent bytes span multiple fields (e.g. ip.flags and ip.frag_offset), and unified embedding function, where semantically different values receive identical vectors (e.g., Total Len=1500 and Win Size=1500). (P3) Flow-level metadata loss:Temporal metadata (hatched) essential for flow-level behavior analysis exists outside packet bytes and is entirely discarded.
Figure 2: Workflow of FlowSem-MAE. Noisy FSUs refer to the union of random and non-generalizable fields.
Figure 3: Model size vs. performance (Macro-F1). FlowSem-MAE achieves the best performance with only 50.25M model size, significantly outperforming larger models.
Figure 4: Effect of predictability-guided filtering on reconstruction loss. Without predictability-guided filtering, random fields (red) exhibit extremely high loss ($\sim10^9$) and degrade learning of generalizable fields (green).
Figure 5: Performance under different labeled data ratios.
...and 2 more figures

Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification

TL;DR

Abstract

Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (7)