Table of Contents
Fetching ...

BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems

Konstantin Burlachenko, Peter Richtárik

TL;DR

BurTorch presents a CPU-centered, compile-based framework for training from first principles by coupling autodiff, math optimization, and systems design. It introduces a latency-focused gradient oracle ∇f(x) = (1/b) ∑_{i∈S} ∇f_i(x) and demonstrates dramatic improvements in latency and memory for small compute graphs, including GPT-3–like models, compared with mainstream frameworks. The key contributions are a compact C++20 implementation with minimal abstractions, serialized gradient computation to minimize activation memory, and a suite of experiments showing up to ×2000 speedups and ×3500 memory reductions on small graphs, with strong performance on GPT-3–like tasks. The work argues that substantial practical gains arise from system-level optimizations—compile-time code generation, memory contiguity, and careful backpropagation design—making BurTorch appealing for on-device training, federated learning, and latency-sensitive applications, albeit with limitations in large-batch scalability and GPU use.

Abstract

In this work, we introduce BurTorch, a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations through an exceptionally efficient CPU-based backpropagation (Rumelhart et al., 1986; Linnainmaa, 1970) implementation. Although modern DL frameworks rely on compilerlike optimizations internally, BurTorch takes a different path. It adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research. By eliminating the overhead of large frameworks and making efficient implementation choices, BurTorch achieves orders-of-magnitude improvements in performance and memory efficiency when computing $\nabla f(x)$ on a CPU. BurTorch features a compact codebase designed to achieve two key goals simultaneously. First, it provides a user experience similar to script-based programming environments. Second, it dramatically minimizes runtime overheads. In large DL frameworks, the primary source of memory overhead for relatively small computation graphs $f(x)$ is due to feature-heavy implementations. We benchmarked BurTorch against widely used DL frameworks in their execution modes: JAX (Bradbury et al., 2018), PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016); and several standalone libraries: Autograd (Maclaurin et al., 2015), Micrograd (Karpathy, 2020), Apple MLX (Hannun et al., 2023). For small compute graphs, BurTorch outperforms best-practice solutions by up to $\times 2000$ in runtime and reduces memory consumption by up to $\times 3500$. For a miniaturized GPT-3 model (Brown et al., 2020), BurTorch achieves up to a $\times 20$ speedup and reduces memory up to $\times 80$ compared to PyTorch.

BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems

TL;DR

BurTorch presents a CPU-centered, compile-based framework for training from first principles by coupling autodiff, math optimization, and systems design. It introduces a latency-focused gradient oracle ∇f(x) = (1/b) ∑_{i∈S} ∇f_i(x) and demonstrates dramatic improvements in latency and memory for small compute graphs, including GPT-3–like models, compared with mainstream frameworks. The key contributions are a compact C++20 implementation with minimal abstractions, serialized gradient computation to minimize activation memory, and a suite of experiments showing up to ×2000 speedups and ×3500 memory reductions on small graphs, with strong performance on GPT-3–like tasks. The work argues that substantial practical gains arise from system-level optimizations—compile-time code generation, memory contiguity, and careful backpropagation design—making BurTorch appealing for on-device training, federated learning, and latency-sensitive applications, albeit with limitations in large-batch scalability and GPU use.

Abstract

In this work, we introduce BurTorch, a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations through an exceptionally efficient CPU-based backpropagation (Rumelhart et al., 1986; Linnainmaa, 1970) implementation. Although modern DL frameworks rely on compilerlike optimizations internally, BurTorch takes a different path. It adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research. By eliminating the overhead of large frameworks and making efficient implementation choices, BurTorch achieves orders-of-magnitude improvements in performance and memory efficiency when computing on a CPU. BurTorch features a compact codebase designed to achieve two key goals simultaneously. First, it provides a user experience similar to script-based programming environments. Second, it dramatically minimizes runtime overheads. In large DL frameworks, the primary source of memory overhead for relatively small computation graphs is due to feature-heavy implementations. We benchmarked BurTorch against widely used DL frameworks in their execution modes: JAX (Bradbury et al., 2018), PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016); and several standalone libraries: Autograd (Maclaurin et al., 2015), Micrograd (Karpathy, 2020), Apple MLX (Hannun et al., 2023). For small compute graphs, BurTorch outperforms best-practice solutions by up to in runtime and reduces memory consumption by up to . For a miniaturized GPT-3 model (Brown et al., 2020), BurTorch achieves up to a speedup and reduces memory up to compared to PyTorch.

Paper Structure

This paper contains 116 sections, 8 equations, 8 figures, 19 tables.

Figures (8)

  • Figure 1: Tiny compute graph with $10$ nodes to evaluate $g=f/2,f=e^2,e=c-d, d=ab + b^3, c=a+b,a=-41,b=2$. Nodes contain: description, operator, $\dfrac{\partial g}{\partial [\mathrm{\textit{node}}]}$, value, raw index. The numerical results across frameworks match exactly.
  • Figure 2: Small compute graph with total $V=32$ nodes and $E=44$ edges to evaluate function from karpathy2020micrograd.
  • Figure 3: Visualization of Table \ref{['tab:execution_times_speedup']}. Backpropagation over $100$K iterations with a tiny dynamic compute graph from Figure \ref{['fig:tiny-compute-graph']}. Computation in FP64, one CPU Core, Windows OS. The numerical results across frameworks match exactly.
  • Figure 4: Listings for the small compute graph shown in Figure \ref{['fig:exp2-small-compute-graph']}, adapted from karpathy2020micrograd.
  • Figure 5: Visualization of Table \ref{['tab:execution_times_speedup_linux']}. Backpropagation over $100$K iterations with a tiny dynamic compute graph from Figure \ref{['fig:tiny-compute-graph']}. Computation in FP64, one CPU Core $3.2$ GHz (x86-64). Linux Ubuntu 20.04.
  • ...and 3 more figures