Table of Contents
Fetching ...

Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures

Marco Siracusa, Olivia Hsu, Victor Soria-Pardos, Joshua Randall, Arnaud Grasset, Eric Biscondi, Doug Joseph, Randy Allen, Fredrik Kjolstad, Miquel Moretó Planas, Adrià Armejach

TL;DR

This work identifies embedding lookups as a fundamental bottleneck in embedding-heavy ML models due to irregular memory accesses. It proposes Decoupled Access-Execute (DAE) hardware and the Ember compiler to automatically lower PyTorch/TensorFlow embeddings into optimized DAE code, leveraging two IRs (DLC and SLC) to achieve both local and global optimizations. The approach yields substantial improvements over GPUs in end-to-end performance and energy efficiency, and its optimizations closely match hand-written DAE implementations. The work demonstrates the potential for scalable, embedding-intensive inference in datacenters and provides a practical path to deploying DAE architectures with no programming burden on users.

Abstract

Irregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6$\times$ higher performance and 6.4$\times$ higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.

Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures

TL;DR

This work identifies embedding lookups as a fundamental bottleneck in embedding-heavy ML models due to irregular memory accesses. It proposes Decoupled Access-Execute (DAE) hardware and the Ember compiler to automatically lower PyTorch/TensorFlow embeddings into optimized DAE code, leveraging two IRs (DLC and SLC) to achieve both local and global optimizations. The approach yields substantial improvements over GPUs in end-to-end performance and energy efficiency, and its optimizations closely match hand-written DAE implementations. The work demonstrates the potential for scalable, embedding-intensive inference in datacenters and provides a practical path to deploying DAE architectures with no programming burden on users.

Abstract

Irregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6 higher performance and 6.4 higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.

Paper Structure

This paper contains 32 sections, 2 equations, 27 figures, 4 tables.

Figures (27)

  • Figure 1: Deep-learning recommendation models (dlrm) (\ref{['sec:dlrms']}), large language models (llm) with sparse attention (\ref{['sec:llms']}), knowledge graphs (kg) (\ref{['sec:graph-learning']}), and graph neural networks (gnn) (\ref{['sec:graph-learning']}) heavily rely on embedding operations that do not perform efficiently even on modern Nvidia H100 GPUs nvidiah100. All experiments use highly-optimized models from the literature (\ref{['sec:characterization-implications']}).
  • Figure 2: Feature embedding requires scattered memory lookups to fetch embedding vectors from embedding tables.
  • Figure 3: Architectural implications of embedding lookups on a traditional CPU core (details in \ref{['fig:dae-specs']}). A large fraction of embedding lookups in GNNs models (\ref{['sec:graph-learning']}) take orders of magnitude longer than L1D accesses (a). For instance, more than 74% of product's lookups are 10$\times$ longer than an L1D access, and 40% more than 100$\times$ longer. However, traditional CPU cores have limited memory-level parallelism, and can only track a few in-flight lookups (b), stalling the CPU pipeline. This results in low memory request throughput (c), and low HBM per core utilization (d).
  • Figure 4: Implications of scaling up the memory-level parallelism of a traditional CPU core (details in \ref{['fig:dae-specs']}) for GNNs embedding operations (\ref{['sec:graph-learning']}). Doubling reorder buffer, load-store queue, and L1D miss-status handling registers (2R.2L.2M) provides limited performance improvements and worse perf/watt than off-the-shelf cores (1R.1L.1M).
  • Figure 5: A DAE processor. Each traditional core offloads embedding lookup to an access unit like the TMU tmu2023micro.
  • ...and 22 more figures