Mozart: A Chiplet Ecosystem-Accelerator Codesign Framework for Composable Bespoke Application Specific Integrated Circuits
Haoran Jin, Jirong Yang, Yunpeng Liu, Barry Lyu, Kangqi Zhang, Nathaniel Bleier
TL;DR
Mozart tackles the heterogeneity of modern AI workloads by moving from monolithic accelerators to a chiplet-based ecosystem that supports operator-level disaggregation, memory specialization, and tensor-level optimizations. It introduces a constraint-aware codesign framework with a four-layer optimization pipeline (simulated annealing, evolutionary search, modified convex hull trick, and place-and-route) to build composable BASICs while modeling energy, performance, and cost. The approach achieves substantial gains—up to tens of percent energy and energy-cost improvements, and notable throughput increases in speculative decoding—while using a compact set of chiplets (8) to balance NRE and performance. Case studies across datacenter LLM serving and edge autonomous-vehicle perception demonstrate practical impact, enabling deployment of heterogeneous, high-efficiency AI accelerators in both cloud and edge contexts, and the authors plan an open-source release.
Abstract
Modern AI acceleration faces a fundamental challenge: conventional assumptions about memory requirements, batching effectiveness, and latency-throughput tradeoffs are systemwide generalizations that ignore the heterogeneous computational patterns of individual neural network operators. However, going towards network-level customization and operator-level heterogeneity incur substantial Non-Recurring Engineering (NRE) costs. While chiplet-based approaches have been proposed to amortize NRE costs, reuse opportunities remain limited without carefully identifying which chiplets are truly necessary. This paper introduces Mozart, a chiplet ecosystem and accelerator codesign framework that systematically constructs low cost bespoke application-specific integrated circuits (BASICs). BASICs leverage operator-level disaggregation to explore chiplet and memory heterogeneity, tensor fusion, and tensor parallelism, with place-and-route validation ensuring physical implementability. The framework also enables constraint-aware system-level optimization across deployment contexts ranging from datacenter inference serving to edge computing in autonomous vehicles. The evaluation confirms that with just 8 strategically selected chiplets, Mozart-generated composite BASICs achieve 43.5%, 25.4%, 67.7%, and 78.8% reductions in energy, energy-cost product, energy-delay product (EDP), and energy-delay-cost product compared to traditional homogeneous accelerators. For datacenter LLM serving, Mozart achieves 15-19% energy reduction and 35-39% energy-cost improvement. In speculative decoding, Mozart delivers throughput improvements of 24.6-58.6% while reducing energy consumption by 38.6-45.6%. For autonomous vehicle perception, Mozart reduces energy-cost by 25.54% and energy by 10.53% under real-time constraints.
