Table of Contents
Fetching ...

SALSA: Simulated Annealing based Loop-Ordering Scheduler for DNN Accelerators

Victor J. B. Jung, Arne Symons, Linyan Mei, Marian Verhelst, Luca Benini

TL;DR

SALSA addresses the challenge of finding high-quality loop-orderings for DNN accelerators across diverse hardware by introducing a dual-engine scheduler that combines exhaustive search with simulated annealing. It decouples loop ordering from memory allocation to support both even and uneven mappings and uses a cost model to optimize energy and latency. On five DNN benchmarks, SALSA achieves 7.6% energy reductions against Timeloop and 11.9% against LOMA while delivering 24x and 1.7x faster search, respectively, and reaches near-optimal schedules with high reliability. The work demonstrates a practical, open-source scheduler that accelerates architecture exploration and improves energy efficiency in DNN accelerators.

Abstract

To meet the growing need for computational power for DNNs, multiple specialized hardware architectures have been proposed. Each DNN layer should be mapped onto the hardware with the most efficient schedule, however, SotA schedulers struggle to consistently provide optimum schedules in a reasonable time across all DNN-HW combinations. This paper proposes SALSA, a fast dual-engine scheduler to generate optimal execution schedules for both even and uneven mapping. We introduce a new strategy, combining exhaustive search with simulated annealing to address the dynamic nature of the loop ordering design space size across layers. SALSA is extensively benchmarked against two SotA schedulers, LOMA and Timeloop on 5 different DNNs, on average SALSA finds schedules with 11.9% and 7.6% lower energy while speeding up the search by 1.7x and 24x compared to LOMA and Timeloop, respectively.

SALSA: Simulated Annealing based Loop-Ordering Scheduler for DNN Accelerators

TL;DR

SALSA addresses the challenge of finding high-quality loop-orderings for DNN accelerators across diverse hardware by introducing a dual-engine scheduler that combines exhaustive search with simulated annealing. It decouples loop ordering from memory allocation to support both even and uneven mappings and uses a cost model to optimize energy and latency. On five DNN benchmarks, SALSA achieves 7.6% energy reductions against Timeloop and 11.9% against LOMA while delivering 24x and 1.7x faster search, respectively, and reaches near-optimal schedules with high reliability. The work demonstrates a practical, open-source scheduler that accelerates architecture exploration and improves energy efficiency in DNN accelerators.

Abstract

To meet the growing need for computational power for DNNs, multiple specialized hardware architectures have been proposed. Each DNN layer should be mapped onto the hardware with the most efficient schedule, however, SotA schedulers struggle to consistently provide optimum schedules in a reasonable time across all DNN-HW combinations. This paper proposes SALSA, a fast dual-engine scheduler to generate optimal execution schedules for both even and uneven mapping. We introduce a new strategy, combining exhaustive search with simulated annealing to address the dynamic nature of the loop ordering design space size across layers. SALSA is extensively benchmarked against two SotA schedulers, LOMA and Timeloop on 5 different DNNs, on average SALSA finds schedules with 11.9% and 7.6% lower energy while speeding up the search by 1.7x and 24x compared to LOMA and Timeloop, respectively.
Paper Structure (19 sections, 3 equations, 6 figures)

This paper contains 19 sections, 3 equations, 6 figures.

Figures (6)

  • Figure 1: Overview of the SALSA implementation.
  • Figure 2: Detailed example of SALSA's Simulated Annealing path. The workload used in this figure is fictional for the purpose of demonstration, and the Memory Hierarchy is composed of three levels: DRAM, Shared Buffer, and Registers.
  • Figure 3: Graph illustrating the required search time for different search strategies for varying numbers of LPFs for AlexNet Layer 2. Note the logarithmic y-axis.
  • Figure 4: Mapping energy distribution during a search for layer 2 of AlexNet. using Timeloop and SALSA. Best viewed in color.
  • Figure 5: Comparison of SALSA, LOMA 7, and Timeloop for 5 DNN. In this figure, LOMA is configured with an LFP limitation factor of 7. The left part displays the Energy and Search Time for every unique layer of ResNet50, while the right part shows the average Energy Reduction and Speed-up of each DNN. Energy Reduction and Speed-up in the right plots are normalized with Timeloop's Energy and Time, respectively.
  • ...and 1 more figures