Table of Contents
Fetching ...

FADiff: Fusion-Aware Differentiable Optimization for DNN Scheduling on Tensor Accelerators

Shuao Jia, Zichao Ling, Chen Bai, Kang Zhao, Jianwang Zhai

TL;DR

The paper tackles the challenge of deploying DNNs on tensor accelerators by jointly optimizing intra-layer mapping and inter-layer fusion, a space traditionally explored separately. It introduces FADiff, a gradient-based framework with differentiable energy and latency models and penalties to co-optimize mapping and fusion, using continuous relaxations like Gumbel-Softmax and a fusion variable. Empirical results on Gemmini configurations show significant EDP reductions over state-of-the-art baselines and layer-wise approaches, highlighting the benefit of joint, differentiable optimization and its applicability to both CNNs and Transformer workloads. This work demonstrates scalable, interpretable, and hardware-valid deployment strategies for modern DNN workloads on tensor accelerators.

Abstract

Efficient deployment of Deep Neural Networks (DNNs), such as Large Language Models (LLMs), on tensor accelerators is essential for maximizing computational efficiency in modern AI systems. However, achieving this is challenging due to the enormous and complex design space created by the interaction of intra-layer mapping and inter-layer fusion. In this work, we present FADiff, a gradient-based optimization framework capable of automatically identifying high-quality intra-layer mapping and inter-layer fusion strategies to accelerate inference for DNN workloads. We first construct a unified and differentiable analytical cost model, which accurately predicts the energy and latency of both single-layer mappings and various layer fusion strategies. Then, by encoding discrete constraints into the loss function, we employ a gradient-based approach to efficiently explore the vast design space, determining the optimal joint strategy for mapping and fusion. Experimental results demonstrate the superiority of FADiff, achieving better optimization in terms of energy and latency compared to existing methods.

FADiff: Fusion-Aware Differentiable Optimization for DNN Scheduling on Tensor Accelerators

TL;DR

The paper tackles the challenge of deploying DNNs on tensor accelerators by jointly optimizing intra-layer mapping and inter-layer fusion, a space traditionally explored separately. It introduces FADiff, a gradient-based framework with differentiable energy and latency models and penalties to co-optimize mapping and fusion, using continuous relaxations like Gumbel-Softmax and a fusion variable. Empirical results on Gemmini configurations show significant EDP reductions over state-of-the-art baselines and layer-wise approaches, highlighting the benefit of joint, differentiable optimization and its applicability to both CNNs and Transformer workloads. This work demonstrates scalable, interpretable, and hardware-valid deployment strategies for modern DNN workloads on tensor accelerators.

Abstract

Efficient deployment of Deep Neural Networks (DNNs), such as Large Language Models (LLMs), on tensor accelerators is essential for maximizing computational efficiency in modern AI systems. However, achieving this is challenging due to the enormous and complex design space created by the interaction of intra-layer mapping and inter-layer fusion. In this work, we present FADiff, a gradient-based optimization framework capable of automatically identifying high-quality intra-layer mapping and inter-layer fusion strategies to accelerate inference for DNN workloads. We first construct a unified and differentiable analytical cost model, which accurately predicts the energy and latency of both single-layer mappings and various layer fusion strategies. Then, by encoding discrete constraints into the loss function, we employ a gradient-based approach to efficiently explore the vast design space, determining the optimal joint strategy for mapping and fusion. Experimental results demonstrate the superiority of FADiff, achieving better optimization in terms of energy and latency compared to existing methods.

Paper Structure

This paper contains 25 sections, 25 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The joint design space of inter-layer Fusion and intra-layer Mapping. While traditional heuristic and learning-based methods (A) explore a limited sub-region and layer-wise gradient approaches (B) are confined to a single axis, our work (C) performs a comprehensive, differentiable search across the entire space to approach the global optimum (red $\star$).
  • Figure 2: Overview of FADiff. Left (regions a--c): hardware architecture, DNN workloads, and design variables. Right (region d): differentiable optimization engine.
  • Figure 3: Comparison of Z-score normalized trend. Left: two-layer fusion. Right: three-layer fusion.
  • Figure 4: EDP vs. optimization time for GA, BO, and our gradient-based method; with the same time budgets, our method converges faster to lower EDP.