Performance Models for a Two-tiered Storage System
Aparna Sasidharan, Xian-He, Jay Lofstead, Scott Klasky
TL;DR
This work tackles IO bottlenecks in HPC by engineering a two-tier storage system with an NVMe-based distributed cache (tier-1) and shared HDD storage (tier-2), underpinned by an end-to-end performance model built from queuing networks and device-behavior analyses. It introduces an online-learning cache replacement that blends LRU, LFU, and Random experts to optimize data movement between tiers, evaluated against Poisson and IRM IO traffic on a many-core cluster. Key contributions include a formal queuing-network framework for end-to-end performance, device-specific performance models for NVMe and HDD, and a practical evaluation showing potential throughput gains and scalability trade-offs. The findings offer actionable guidance for configuring tiered HPC storage and motivate future work on adaptive prefetching and data-migration strategies that operate with online training.
Abstract
This work describes the design, implementation and performance analysis of a distributed two-tiered storage software. The first tier functions as a distributed software cache implemented using solid-state devices~(NVMes) and the second tier consists of multiple hard disks~(HDDs). We describe an online learning algorithm that manages data movement between the tiers. The software is hybrid, i.e. both distributed and multi-threaded. The end-to-end performance model of the two-tier system was developed using queuing networks and behavioral models of storage devices. We identified significant parameters that affect the performance of storage devices and created behavioral models for each device. The performance of the software was evaluated on a many-core cluster using non-trivial read/write workloads. The paper provides examples to illustrate the use of these models.
