Just-in-Time Packet State Prefetching

Hamid Ghasemirahni; Alireza Farshin; Dejan Kostic; Marco Chiesa

Just-in-Time Packet State Prefetching

Hamid Ghasemirahni, Alireza Farshin, Dejan Kostic, Marco Chiesa

TL;DR

The paper tackles the bottleneck of per-flow state in high-speed, CPU-based packet processing by proposing Nostradamus, a system to provide hints about upcoming packets to enable just-in-time prefetching of necessary state into caches. It demonstrates that careful timing and placement of prefetches can recover substantial throughput, with measurements showing up to $50\%$ improvements for a stateful L4 load balancer and reduced cache misses. The authors discuss the design space for providing hints (host vs network devices) and prefetching strategies (in-app vs NIC-assisted), outline challenges, and chart future directions across applications, hardware accelerators, and data structures. Overall, the work highlights a promising approach to bridge networking requirements and cache hierarchies, potentially enabling higher throughput at multi-hundred-Gbps rates.

Abstract

Could information about future incoming packets be used to build more efficient CPU-based packet processors? Can such information be obtained accurately? This paper studies novel packet processing architectures that receive external hints about which packets are soon to arrive, thus enabling prefetching into fast cache memories of the state needed to process them, just-in-time for the packets' arrival. We explore possible approaches to (i) obtain such hints either from network devices or the end hosts in the communication and (ii) use these hints to better utilize cache memories. We show that such information (if accurate) can improve packet processing throughput by at least 50%.

Just-in-Time Packet State Prefetching

TL;DR

improvements for a stateful L4 load balancer and reduced cache misses. The authors discuss the design space for providing hints (host vs network devices) and prefetching strategies (in-app vs NIC-assisted), outline challenges, and chart future directions across applications, hardware accelerators, and data structures. Overall, the work highlights a promising approach to bridge networking requirements and cache hierarchies, potentially enabling higher throughput at multi-hundred-Gbps rates.

Abstract

Paper Structure (14 sections, 5 figures, 1 table)

This paper contains 14 sections, 5 figures, 1 table.

Introduction
Background and Motivation
Load Balancers
Impact of Statefulness on Performance
Prefetch the State in Advance
Challenges and Solutions
Potential Benefits
Building a Just-in-time Prefetcher
Future Use Cases and Directions
Applications
Programmable Hardware & Accelerators
Optimizing Data Structures and Code
Further Cache Optimizations
Conclusion

Figures (5)

Figure 1: Increasing the number of flows causes an exponential decay in throughput of an L4 load balancer.
Figure 2: The average number of per-packet LLC misses increases with larger numbers of flows, which is inversely proportional to the throughput. The exponential increase in the number of per-packet L2 misses corresponds to the initial throughput drop.
Figure 3: Fine-tuning the spatial prefetching distance is essential to maximize the throughput improvements.
Figure 4: Using prefetchnta reduces the throughput improvements for large spatial prefetching distances, due to its lower temporal & spatial locality.
Figure 5: Performing just-in-time prefetching improves the throughput by up to 50% ( .i.e ,i.e. !i.e. ?i.e. )i.e.i.e., it recovers the throughput drop due to statefulness).

Just-in-Time Packet State Prefetching

TL;DR

Abstract

Just-in-Time Packet State Prefetching

Authors

TL;DR

Abstract

Table of Contents

Figures (5)