Table of Contents
Fetching ...

DOMAC: Differentiable Optimization for High-Speed Multipliers and Multiply-Accumulators

Chenhao Xue, Yi Ren, Jinwei Zhou, Kezhi Li, Chen Zhang, Yibo Lin, Lining Zhang, Qiang Xu, Guangyu Sun

TL;DR

DOMAC addresses the design-space explosion in high-speed multipliers by formulating compressor-tree design as differentiable optimization, enabling gradient-based search across interconnections and implementations. By mapping CT design to a DNN-training paradigm, it introduces differentiable timing and area objectives and a legalization step to recover discrete solutions. Key contributions include a differentiable area objective, a differentiable timing pipeline (pin load, cell delay, net delay propagation, and timing slack estimation), and regularization via softmax substitutions and LSE smoothing. Empirical results show DOMAC achieving up to $6.5\%$ delay reduction and $25\%$ area reduction relative to commercial IPs, validating its practicality for technology-node aware multiplier and MAC synthesis.

Abstract

Multipliers and multiply-accumulators (MACs) are fundamental building blocks for compute-intensive applications such as artificial intelligence. With the diminishing returns of Moore's Law, optimizing multiplier performance now necessitates process-aware architectural innovations rather than relying solely on technology scaling. In this paper, we introduce DOMAC, a novel approach that employs differentiable optimization for designing multipliers and MACs at specific technology nodes. DOMAC establishes an analogy between optimizing multi-staged parallel compressor trees and training deep neural networks. Building on this insight, DOMAC reformulates the discrete optimization challenge into a continuous problem by incorporating differentiable timing and area objectives. This formulation enables us to utilize existing deep learning toolkit for highly efficient implementation of the differentiable solver. Experimental results demonstrate that DOMAC achieves significant enhancements in both performance and area efficiency compared to state-of-the-art baselines and commercial IPs in multiplier and MAC designs.

DOMAC: Differentiable Optimization for High-Speed Multipliers and Multiply-Accumulators

TL;DR

DOMAC addresses the design-space explosion in high-speed multipliers by formulating compressor-tree design as differentiable optimization, enabling gradient-based search across interconnections and implementations. By mapping CT design to a DNN-training paradigm, it introduces differentiable timing and area objectives and a legalization step to recover discrete solutions. Key contributions include a differentiable area objective, a differentiable timing pipeline (pin load, cell delay, net delay propagation, and timing slack estimation), and regularization via softmax substitutions and LSE smoothing. Empirical results show DOMAC achieving up to delay reduction and area reduction relative to commercial IPs, validating its practicality for technology-node aware multiplier and MAC synthesis.

Abstract

Multipliers and multiply-accumulators (MACs) are fundamental building blocks for compute-intensive applications such as artificial intelligence. With the diminishing returns of Moore's Law, optimizing multiplier performance now necessitates process-aware architectural innovations rather than relying solely on technology scaling. In this paper, we introduce DOMAC, a novel approach that employs differentiable optimization for designing multipliers and MACs at specific technology nodes. DOMAC establishes an analogy between optimizing multi-staged parallel compressor trees and training deep neural networks. Building on this insight, DOMAC reformulates the discrete optimization challenge into a continuous problem by incorporating differentiable timing and area objectives. This formulation enables us to utilize existing deep learning toolkit for highly efficient implementation of the differentiable solver. Experimental results demonstrate that DOMAC achieves significant enhancements in both performance and area efficiency compared to state-of-the-art baselines and commercial IPs in multiplier and MAC designs.

Paper Structure

This paper contains 22 sections, 13 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Architecture of (a) multipliers and (b) fused multipliy-accumulators
  • Figure 2: Example of adjusting compressor tree interconnection, where the partial products and compressor inputs can be assigned interchangeably.
  • Figure 3: Example of two different implementations for 3:2 compressor using Nangate45 Open Cell Library nangate2008freepdk45. We estimate the delay using a nonlinear delay model (NLDM) with $\text{islew} = 0.02ns$ and $\text{oload} = 3fF$ for all input-output pairs.
  • Figure 4: Pareto frontiers of the synthesized results on multipliers.
  • Figure 5: Pareto frontiers of the synthesized results on MACs.
  • ...and 1 more figures