Performance Characterization and Optimizations of Traditional ML Applications
Harsh Kumar, R. Govindarajan
TL;DR
Traditional ML methods remain memory-bound on large datasets, with tree-based workloads exhibiting substantial misprediction and memory stalls. The paper performs a microarchitectural characterization of 13 ML workloads across matrix-, neighbor-, and tree-based families on single- and small multi-core servers, using VTune and perf. It demonstrates that DRAM latency, LLC misses, and bad-speculation dominate performance, and shows that targeted optimizations—software prefetching and data-layout/computation reordering—can yield meaningful speedups (up to ~60% in some cases) implemented in scikit-learn. These findings provide practical, portable guidance for accelerating traditional ML pipelines on commodity hardware in real-world data science tasks.
Abstract
Even in the era of Deep Learning based methods, traditional machine learning methods with large data sets continue to attract significant attention. However, we find an apparent lack of a detailed performance characterization of these methods in the context of large training datasets. In this work, we study the system's behavior of a number of traditional ML methods as implemented in popular free software libraries/modules to identify critical performance bottlenecks experienced by these applications. The performance characterization study reveals several interesting insights on the performance of these applications. Then we evaluate the performance benefits of applying some well-known optimizations at the levels of caches and the main memory. More specifically, we test the usefulness of optimizations such as (i) software prefetching to improve cache performance and (ii) data layout and computation reordering optimizations to improve locality in DRAM accesses. These optimizations are implemented as modifications to the well-known scikit-learn library, and hence can be easily leveraged by application programmers. We evaluate the impact of the proposed optimizations using a combination of simulation and execution on a real system. The software prefetching optimization results in performance benefits varying from 5.2%-27.1% on different ML applications while the data layout and computation reordering approaches yield 6.16%-28.0% performance improvement.
