Table of Contents
Fetching ...

ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry

TL;DR

Dynamic control flow in deep learning complicates batching, hindering throughput. ACRoBat introduces a hybrid static+dynamic compile-time framework that generates end-to-end tensor kernels and leverages ahead-of-time code to lazily construct and schedule dataflow graphs, using fibers for tensor-dependent control flow. Its taint-based analysis identifies parameter reuse and data fusion opportunities, while ghost operations and program phases optimize scheduling. Empirical results show substantial performance gains over DyNet and competitive results with Cortex on Nvidia GPUs, underscoring the practical impact of hybrid analysis and kernel generation for dynamic DL workloads.

Abstract

Dynamic control flow is an important technique often used to design expressive and efficient deep learning computations for applications such as text parsing, machine translation, exiting early out of deep models and so on. The control flow divergence resulting from dynamic control flow makes batching, an important optimization enabling high throughput and hardware utilization, difficult to perform manually. In this paper, we present ACRoBat, a framework that enables efficient automatic batching for dynamic deep learning computations by performing hybrid static+dynamic compiler optimizations and end-to-end tensor code generation. ACRoBat performs up to 8.5X better than DyNet, a state-of-the-art framework for automatic batching, on an Nvidia GeForce GPU.

ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

TL;DR

Dynamic control flow in deep learning complicates batching, hindering throughput. ACRoBat introduces a hybrid static+dynamic compile-time framework that generates end-to-end tensor kernels and leverages ahead-of-time code to lazily construct and schedule dataflow graphs, using fibers for tensor-dependent control flow. Its taint-based analysis identifies parameter reuse and data fusion opportunities, while ghost operations and program phases optimize scheduling. Empirical results show substantial performance gains over DyNet and competitive results with Cortex on Nvidia GPUs, underscoring the practical impact of hybrid analysis and kernel generation for dynamic DL workloads.

Abstract

Dynamic control flow is an important technique often used to design expressive and efficient deep learning computations for applications such as text parsing, machine translation, exiting early out of deep models and so on. The control flow divergence resulting from dynamic control flow makes batching, an important optimization enabling high throughput and hardware utilization, difficult to perform manually. In this paper, we present ACRoBat, a framework that enables efficient automatic batching for dynamic deep learning computations by performing hybrid static+dynamic compiler optimizations and end-to-end tensor code generation. ACRoBat performs up to 8.5X better than DyNet, a state-of-the-art framework for automatic batching, on an Nvidia GeForce GPU.
Paper Structure (31 sections, 11 figures, 9 tables)

This paper contains 31 sections, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Overview of ACRoBat's workflow. Fig. \ref{['fig:ap_dynet_overview']} in the appendix shows a corresponding overview of DyNet, a prior fully dynamic approach. Note how ACRoBat performs significant novel analysis and code generation at compile-time to reduce runtime overheads.
  • Figure 2: A simple RNN model expressed in a functional language (here, Relay relay is used for illustration) as an input to ACRoBat.
  • Figure 3: AOT compiled output for the RNN model in Listing \ref{['code:rnn_relay']}, with inline depth computation code highlighted.
  • Figure 4: Concurrent call annotation.
  • Figure 5: Ghost operators can enable better batching.
  • ...and 6 more figures