Table of Contents
Fetching ...

Direct Feature Access -- Scaling Network Traffic Feature Collection to Terabit Speed

Lukas Froschauer, Jonatan Langlet, Andreas Kassler

TL;DR

The paper tackles the challenge of real-time, fine-grained network telemetry at terabit speeds for ML-based analysis. It introduces Direct Feature Access (DFA), a system that performs feature extraction directly in P4-programmable switches and streams feature vectors to GPU memory via GPUDirect RDMA, bypassing CPU-bound control planes. DFA achieves over 31 million feature vectors per second and supports 524,000 flows within 20 ms on a single port, demonstrated on Intel Tofino and NVIDIA A100 hardware. This approach eliminates key control-plane bottlenecks, enabling scalable, low-latency, ML-driven traffic analytics at terabit scales, with significant practical impact for real-time network monitoring and security.

Abstract

Real-time traffic monitoring is critical for network operators to ensure performance, security, and visibility, especially as encryption becomes the norm. AI and ML have emerged as powerful tools to create deeper insights from network traffic, but collecting the fine-grained features needed at terabit speeds remains a major bottleneck. We introduce Direct Feature Access (DFA): a high-speed telemetry system that extracts flow features at line rate using P4-programmable data planes, and delivers them directly to GPUs via RDMA and GPUDirect, completely bypassing the ML server's CPU. DFA enables feature enrichment and immediate inference on GPUs, eliminating traditional control plane bottlenecks and dramatically reducing latency. We implement DFA on Intel Tofino switches and NVIDIA A100 GPUs, achieving extraction and delivery of over 31 million feature vectors per second, supporting 524,000 flows within sub-20 ms monitoring periods, on a single port. DFA unlocks scalable, real-time, ML-driven traffic analysis at terabit speeds, pushing the frontier of what is possible for next-generation network monitoring.

Direct Feature Access -- Scaling Network Traffic Feature Collection to Terabit Speed

TL;DR

The paper tackles the challenge of real-time, fine-grained network telemetry at terabit speeds for ML-based analysis. It introduces Direct Feature Access (DFA), a system that performs feature extraction directly in P4-programmable switches and streams feature vectors to GPU memory via GPUDirect RDMA, bypassing CPU-bound control planes. DFA achieves over 31 million feature vectors per second and supports 524,000 flows within 20 ms on a single port, demonstrated on Intel Tofino and NVIDIA A100 hardware. This approach eliminates key control-plane bottlenecks, enabling scalable, low-latency, ML-driven traffic analytics at terabit scales, with significant practical impact for real-time network monitoring and security.

Abstract

Real-time traffic monitoring is critical for network operators to ensure performance, security, and visibility, especially as encryption becomes the norm. AI and ML have emerged as powerful tools to create deeper insights from network traffic, but collecting the fine-grained features needed at terabit speeds remains a major bottleneck. We introduce Direct Feature Access (DFA): a high-speed telemetry system that extracts flow features at line rate using P4-programmable data planes, and delivers them directly to GPUs via RDMA and GPUDirect, completely bypassing the ML server's CPU. DFA enables feature enrichment and immediate inference on GPUs, eliminating traditional control plane bottlenecks and dramatically reducing latency. We implement DFA on Intel Tofino switches and NVIDIA A100 GPUs, achieving extraction and delivery of over 31 million feature vectors per second, supporting 524,000 flows within sub-20 ms monitoring periods, on a single port. DFA unlocks scalable, real-time, ML-driven traffic analysis at terabit speeds, pushing the frontier of what is possible for next-generation network monitoring.

Paper Structure

This paper contains 19 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: An overview of the DFA telemetry data collection system, comprised of multiple instances of DFA Reporters, Translators, and Collectors.
  • Figure 2: Packet headers and sizes for DTA (used by the DFA Reporter) and RoCEv2 packets with DFA data (used by the DFA Translator).
  • Figure 3: Schematic path of data through Collector. DTA (in red) copies data from the smartNIC to host memory and requires costly memcopy operations to the GPU involving the host CPU for further processing. In contrast, the DFA Collector (in green) uses GPUDirect RDMA to bypass the host CPU.
  • Figure 4: Illustration of the memory structure maintaining 10 history entries per flow record. Each entry comprises telemetry features (packet count, inter-arrival time (IAT), packet sizes (PS)), the network five-tuple identifier, and a checksum for flow identification.
  • Figure 5: DFA Reporter design overview indicating interaction between data plane (implemented in P4) and control plane programs (implemented in Python).
  • ...and 4 more figures