Table of Contents
Fetching ...

MONET: Modeling and Optimization of neural NEtwork Training from Edge to Data Centers

Jérémy Morlier, Robin Geens, Stef Cuyckens, Arne Symons, Marian Verhelst, Vincent Gripon, Mathieu Léonardon

Abstract

While hardware-software co-design has significantly improved the efficiency of neural network inference, modeling the training phase remains a critical yet underexplored challenge. Training workloads impose distinct constraints, particularly regarding memory footprint and backpropagation complexity, which existing inference-focused tools fail to capture. This paper introduces MONET, a framework designed to model the training of neural networks on heterogeneous dataflow accelerators. MONET builds upon Stream, an experimentally verified framework that that models the inference of neural networks on heterogeneous dataflow accelerators with layer fusion. Using MONET, we explore the design space of ResNet-18 and a small GPT-2, demonstrating the framework's capability to model training workflows and find better hardware architectures. We then further examine problems that become more complex in neural network training due to the larger design space, such as determining the best layer-fusion configuration. Additionally, we use our framework to find interesting trade-offs in activation checkpointing, with the help of a genetic algorithm. Our findings highlight the importance of a holistic approach to hardware-software co-design for scalable and efficient deep learning deployment.

MONET: Modeling and Optimization of neural NEtwork Training from Edge to Data Centers

Abstract

While hardware-software co-design has significantly improved the efficiency of neural network inference, modeling the training phase remains a critical yet underexplored challenge. Training workloads impose distinct constraints, particularly regarding memory footprint and backpropagation complexity, which existing inference-focused tools fail to capture. This paper introduces MONET, a framework designed to model the training of neural networks on heterogeneous dataflow accelerators. MONET builds upon Stream, an experimentally verified framework that that models the inference of neural networks on heterogeneous dataflow accelerators with layer fusion. Using MONET, we explore the design space of ResNet-18 and a small GPT-2, demonstrating the framework's capability to model training workflows and find better hardware architectures. We then further examine problems that become more complex in neural network training due to the larger design space, such as determining the best layer-fusion configuration. Additionally, we use our framework to find interesting trade-offs in activation checkpointing, with the help of a genetic algorithm. Our findings highlight the importance of a holistic approach to hardware-software co-design for scalable and efficient deep learning deployment.
Paper Structure (21 sections, 10 equations, 12 figures, 3 tables)

This paper contains 21 sections, 10 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Energy–latency trade-offs of a ResNet-18 across a diverse range of Edge TPU architectural configurations. Results are shown for inference (top) and training (bottom). Each point represents a distinct hardware configuration, with energy consumption (pJ) plotted against execution latency (cycles) and color-coded by total computational capacity. This highlights the energy/latency distribution difference between inference and training and the need to model hardware specifically for training purposes.
  • Figure 2: (a) All activations (edges) are saved during the forward pass and reused for the backward pass. (b) Some activations are discarded and recomputed during the backward pass, reducing the total memory cost.
  • Figure 3: Peak memory consumption (GBs) breakdown for a Resnet-50 measured on an RTX3090 with an image of size 224 by 224, with two different batch sizes, (1 and 8).
  • Figure 4: EdgeTPU architecture: A set of $n_{PEs}$ Processing Elements (PE) is arranged in a 2D array, each capable of communicating with its neighbours and a common bus link all of them to an off-chip memory. Each PE is composed of a memory and a weight stationary accelerator composed of $U$ SIMD Units and $L$ Compute Lanes. Adapted from Zhou2022CoDesignTPU.
  • Figure 5: Three strategies for spatially parallelizing deep learning workloads across devices Cerebras2023WSE: (a) Data Parallel: Distributes the batch dimension across multiple devices, with each device processing a distinct subset of samples. (b) Pipelined Parallel: Partitions the model into stages, with each device responsible for executing a specific stage of the computation. (c) Tensor Parallel: Splits individual layers or operations across devices, enabling parallel processing of a single sample.
  • ...and 7 more figures