Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures
Marco Siracusa, Olivia Hsu, Victor Soria-Pardos, Joshua Randall, Arnaud Grasset, Eric Biscondi, Doug Joseph, Randy Allen, Fredrik Kjolstad, Miquel Moretó Planas, Adrià Armejach
TL;DR
This work identifies embedding lookups as a fundamental bottleneck in embedding-heavy ML models due to irregular memory accesses. It proposes Decoupled Access-Execute (DAE) hardware and the Ember compiler to automatically lower PyTorch/TensorFlow embeddings into optimized DAE code, leveraging two IRs (DLC and SLC) to achieve both local and global optimizations. The approach yields substantial improvements over GPUs in end-to-end performance and energy efficiency, and its optimizations closely match hand-written DAE implementations. The work demonstrates the potential for scalable, embedding-intensive inference in datacenters and provides a practical path to deploying DAE architectures with no programming burden on users.
Abstract
Irregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6$\times$ higher performance and 6.4$\times$ higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.
