Table of Contents
Fetching ...

Mozart: A Chiplet Ecosystem-Accelerator Codesign Framework for Composable Bespoke Application Specific Integrated Circuits

Haoran Jin, Jirong Yang, Yunpeng Liu, Barry Lyu, Kangqi Zhang, Nathaniel Bleier

TL;DR

Mozart tackles the heterogeneity of modern AI workloads by moving from monolithic accelerators to a chiplet-based ecosystem that supports operator-level disaggregation, memory specialization, and tensor-level optimizations. It introduces a constraint-aware codesign framework with a four-layer optimization pipeline (simulated annealing, evolutionary search, modified convex hull trick, and place-and-route) to build composable BASICs while modeling energy, performance, and cost. The approach achieves substantial gains—up to tens of percent energy and energy-cost improvements, and notable throughput increases in speculative decoding—while using a compact set of chiplets (8) to balance NRE and performance. Case studies across datacenter LLM serving and edge autonomous-vehicle perception demonstrate practical impact, enabling deployment of heterogeneous, high-efficiency AI accelerators in both cloud and edge contexts, and the authors plan an open-source release.

Abstract

Modern AI acceleration faces a fundamental challenge: conventional assumptions about memory requirements, batching effectiveness, and latency-throughput tradeoffs are systemwide generalizations that ignore the heterogeneous computational patterns of individual neural network operators. However, going towards network-level customization and operator-level heterogeneity incur substantial Non-Recurring Engineering (NRE) costs. While chiplet-based approaches have been proposed to amortize NRE costs, reuse opportunities remain limited without carefully identifying which chiplets are truly necessary. This paper introduces Mozart, a chiplet ecosystem and accelerator codesign framework that systematically constructs low cost bespoke application-specific integrated circuits (BASICs). BASICs leverage operator-level disaggregation to explore chiplet and memory heterogeneity, tensor fusion, and tensor parallelism, with place-and-route validation ensuring physical implementability. The framework also enables constraint-aware system-level optimization across deployment contexts ranging from datacenter inference serving to edge computing in autonomous vehicles. The evaluation confirms that with just 8 strategically selected chiplets, Mozart-generated composite BASICs achieve 43.5%, 25.4%, 67.7%, and 78.8% reductions in energy, energy-cost product, energy-delay product (EDP), and energy-delay-cost product compared to traditional homogeneous accelerators. For datacenter LLM serving, Mozart achieves 15-19% energy reduction and 35-39% energy-cost improvement. In speculative decoding, Mozart delivers throughput improvements of 24.6-58.6% while reducing energy consumption by 38.6-45.6%. For autonomous vehicle perception, Mozart reduces energy-cost by 25.54% and energy by 10.53% under real-time constraints.

Mozart: A Chiplet Ecosystem-Accelerator Codesign Framework for Composable Bespoke Application Specific Integrated Circuits

TL;DR

Mozart tackles the heterogeneity of modern AI workloads by moving from monolithic accelerators to a chiplet-based ecosystem that supports operator-level disaggregation, memory specialization, and tensor-level optimizations. It introduces a constraint-aware codesign framework with a four-layer optimization pipeline (simulated annealing, evolutionary search, modified convex hull trick, and place-and-route) to build composable BASICs while modeling energy, performance, and cost. The approach achieves substantial gains—up to tens of percent energy and energy-cost improvements, and notable throughput increases in speculative decoding—while using a compact set of chiplets (8) to balance NRE and performance. Case studies across datacenter LLM serving and edge autonomous-vehicle perception demonstrate practical impact, enabling deployment of heterogeneous, high-efficiency AI accelerators in both cloud and edge contexts, and the authors plan an open-source release.

Abstract

Modern AI acceleration faces a fundamental challenge: conventional assumptions about memory requirements, batching effectiveness, and latency-throughput tradeoffs are systemwide generalizations that ignore the heterogeneous computational patterns of individual neural network operators. However, going towards network-level customization and operator-level heterogeneity incur substantial Non-Recurring Engineering (NRE) costs. While chiplet-based approaches have been proposed to amortize NRE costs, reuse opportunities remain limited without carefully identifying which chiplets are truly necessary. This paper introduces Mozart, a chiplet ecosystem and accelerator codesign framework that systematically constructs low cost bespoke application-specific integrated circuits (BASICs). BASICs leverage operator-level disaggregation to explore chiplet and memory heterogeneity, tensor fusion, and tensor parallelism, with place-and-route validation ensuring physical implementability. The framework also enables constraint-aware system-level optimization across deployment contexts ranging from datacenter inference serving to edge computing in autonomous vehicles. The evaluation confirms that with just 8 strategically selected chiplets, Mozart-generated composite BASICs achieve 43.5%, 25.4%, 67.7%, and 78.8% reductions in energy, energy-cost product, energy-delay product (EDP), and energy-delay-cost product compared to traditional homogeneous accelerators. For datacenter LLM serving, Mozart achieves 15-19% energy reduction and 35-39% energy-cost improvement. In speculative decoding, Mozart delivers throughput improvements of 24.6-58.6% while reducing energy consumption by 38.6-45.6%. For autonomous vehicle perception, Mozart reduces energy-cost by 25.54% and energy by 10.53% under real-time constraints.

Paper Structure

This paper contains 25 sections, 3 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Ideal neural network accelerators can support heterogeneous workloads, while still being flexible enough to support emerging workloads. They can be deployed to support various resource constrained applications. They can be designed and manufactured at low cost.
  • Figure 2: Heterogeneous memory systems enable significant cost optimization without performance degradation. Moving from homogeneous HBM3E to strategic combinations of HBM3E, GDDR7, and DDR5 maintains identical latency performance while achieving memory cost reductions of 25.4-96.7% across CNN and GPT models through operator-specific memory allocation based on compute vs. memory-bound classifications. Memory costs are from wikipedia_hbmwikipedia_lpddrsamsung_k4z80325bc_datasheetjedec_hbm3_2022.
  • Figure 3: Batch scaling behavior varies dramatically across LLM operations, revealing operator-level heterogeneity that contradicts system-wide batching assumptions. Batching curves correspond to throughput scaling (right axis) while layer curves show latency scaling (left axis).
  • Figure 4: Architecture template of Mozart, showing for stall-free pipeline execution and token passing for memory access arbitration.
  • Figure 5: Mozart's four-layer hierarchical framework: simulated annealing for chiplet pool composition, genetic algorithm for tensor fusion and buffer configuration, modified convex hull for chiplet selection, and place-and-route for physical implementation.
  • ...and 7 more figures