Table of Contents
Fetching ...

UFO-MAC: A Unified Framework for Optimization of High-Performance Multipliers and Multiply-Accumulators

Dongsheng Zuo, Jiadong Zhu, Chenglin Li, Yuzhe Ma

TL;DR

UFO-MAC tackles the challenge of optimizing high-performance multipliers and MACs by unifying compressor-tree optimization with a non-uniform CPA arrival model. It uses ILP to optimize compressor assignment and interconnection order, and introduces a high-fidelity timing model (mpfo and FDC) to guide final adder optimization, including a fused MAC variant. Experimental results show UFO-MAC Pareto-dominates state-of-the-art baselines and commercial IP in area and delay, validated in real modules such as FIR filters and AI systolic arrays. The framework achieves substantial practical gains, demonstrating its effectiveness for AI accelerators and signal processing hardware. Future work includes extending the approach to floating-point datapaths and broader Processing Element (PE) array designs.

Abstract

Multipliers and multiply-accumulators (MACs) are critical arithmetic circuit components in the modern era. As essential components of AI accelerators, they significantly influence the area and performance of compute-intensive circuits. This paper presents UFO-MAC, a unified framework for the optimization of multipliers and MACs. Specifically, UFO-MAC employs an optimal compressor tree structure and utilizes integer linear programming (ILP) to refine the stage assignment and interconnection of the compressors. Additionally, it explicitly exploits the non-uniform arrival time profile of the carry propagate adder (CPA) within multipliers to achieve targeted optimization. Moreover, the framework also supports the optimization of fused MAC architectures. Experimental results demonstrate that multipliers and MACs optimized by UFO-MAC Pareto-dominate state-of-the-art baselines and commercial IP libraries. The performance gain of UFO-MAC is further validated through the implementation of multipliers and MACs within functional modules, underlining its efficacy in real scenarios.

UFO-MAC: A Unified Framework for Optimization of High-Performance Multipliers and Multiply-Accumulators

TL;DR

UFO-MAC tackles the challenge of optimizing high-performance multipliers and MACs by unifying compressor-tree optimization with a non-uniform CPA arrival model. It uses ILP to optimize compressor assignment and interconnection order, and introduces a high-fidelity timing model (mpfo and FDC) to guide final adder optimization, including a fused MAC variant. Experimental results show UFO-MAC Pareto-dominates state-of-the-art baselines and commercial IP in area and delay, validated in real modules such as FIR filters and AI systolic arrays. The framework achieves substantial practical gains, demonstrating its effectiveness for AI accelerators and signal processing hardware. Future work includes extending the approach to floating-point datapaths and broader Processing Element (PE) array designs.

Abstract

Multipliers and multiply-accumulators (MACs) are critical arithmetic circuit components in the modern era. As essential components of AI accelerators, they significantly influence the area and performance of compute-intensive circuits. This paper presents UFO-MAC, a unified framework for the optimization of multipliers and MACs. Specifically, UFO-MAC employs an optimal compressor tree structure and utilizes integer linear programming (ILP) to refine the stage assignment and interconnection of the compressors. Additionally, it explicitly exploits the non-uniform arrival time profile of the carry propagate adder (CPA) within multipliers to achieve targeted optimization. Moreover, the framework also supports the optimization of fused MAC architectures. Experimental results demonstrate that multipliers and MACs optimized by UFO-MAC Pareto-dominate state-of-the-art baselines and commercial IP libraries. The performance gain of UFO-MAC is further validated through the implementation of multipliers and MACs within functional modules, underlining its efficacy in real scenarios.
Paper Structure (20 sections, 27 equations, 13 figures, 2 tables, 2 algorithms)

This paper contains 20 sections, 27 equations, 13 figures, 2 tables, 2 algorithms.

Figures (13)

  • Figure 1: Motivating example: The optimization of the CT and the CPA are not decoupled; CPA exhibits a non-uniform arrival time profile, requiring optimization strategies different from those of traditional adders methodology
  • Figure 2: Multiplier Architecture
  • Figure 3: Fused MAC Architecture
  • Figure 4: Critical path delay distribution of 10000 random interconnect order with one same CT stage structure.
  • Figure 5: UFO-MAC framework. The framework first generates optimal CT structures and then performs timing-driven optimizations on the CPA based on a non-uniform arrival profile to achieve area-delay efficiency.
  • ...and 8 more figures