Muchisim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems

Marcelo Orenes-Vera; Esin Tureci; Margaret Martonosi; David Wentzlaff

Muchisim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems

Marcelo Orenes-Vera, Esin Tureci, Margaret Martonosi, David Wentzlaff

TL;DR

MuchiSim addresses the challenge of design-space exploration for scale-out, communication-intensive multi-chiplet manycore systems by offering a scalable, cycle-accurate, host-executed simulator that models data movement and network traffic cycle-by-cycle while providing energy, area, and cost estimates. The framework supports various tile-based architectures, memory hierarchies, and inter-chip interconnects, along with multiple parallelization strategies and a benchmark suite to assess performance across diverse DCi workloads. Key contributions include a novel distributed manycore performance model, detailed energy/area/cost models for multi-chip modules and interposers, and visualization tools that aid comparative analysis, all validated against real hardware and demonstrated at large scales. The results show strong validation, linear or near-linear scalability with host threads, and actionable insights for tuning memory, compute, and network resources, making MuchiSim a practical open-source tool for architecture research and design optimization.

Abstract

The design space exploration of scaled-out manycores for communication-intensive applications (e.g., graph analytics and sparse linear algebra) is hampered due to either lack of scalability or accuracy of existing frameworks at simulating data-dependent execution patterns. This paper presents MuchiSim, a novel parallel simulator designed to address these challenges when exploring the design space of distributed multi-chiplet manycore architectures. We evaluate MuchiSim at simulating systems with up to a million interconnected processing units (PUs) while modeling data movement and communication cycle by cycle. In addition to performance, MuchiSim reports the energy, area, and cost of the simulated system. It also comes with a benchmark application suite and two data visualization tools. MuchiSim supports various parallelization strategies and communication primitives such as task-based parallelization and message passing, making it highly relevant for architectures with software-managed coherence and distributed memory. Via a case study, we show that MuchiSim helps users explore the balance between memory and computation units and the constraints related to chiplet integration and inter-chip communication. MuchiSim enables evaluating new techniques or design parameters for systems at scales that are more realistic for modern parallel systems, opening the gate for further research in this area.

Muchisim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems

TL;DR

Abstract

Paper Structure (18 sections, 5 figures, 1 table)

This paper contains 18 sections, 5 figures, 1 table.

Introduction
Background and Motivation
Full-system vs. Application-level Simulation
Simulating DCI Applications on Manycores
Scope of Applicability
MuchiSim Simulation Framework
The Target Architecture Class
Describing and Mapping an Application
Simulating the Application Runtime
Energy and Area Model
Cost Model
Visualization Tools
Benchmark Suite
Results
Validation
...and 3 more sections

Figures (5)

Figure 1: Hierarchical overview of how tiles can be organized on MuchiSim's target architecture. The board of a cluster node may contain one or multiple chip packages, each with one or multiple chiplets. Packages can be composed of only compute chiplets (with a grid of tiles), or also DRAM chiplets (adjacent to the compute chiplets). Chiplets also include the physical layer (PHY) for inter-chiplet communication. A tile contains one or more processing units (PUs), a private local memory (PLM), a network router (R), and a task scheduling unit (TSU). The task queues are mapped into the PLM.
Figure 2: Animation of the router activity when running BFS on RMAT-22 for three different NoCs: 2D mesh (top), 2D torus (middle), and 2D torus with reduction trees (bottom). The left panels show the routing activity and the right panels show the PU (core) activity. No router activity can mean that the router has no messages to route, or that the NoC is clogged and messages are stuck. The animation is composed of snapshots at a rate of a frame every 40 microseconds (this rate is configurable in MuchiSim). Since this rate is the same for these plots, the number of frames (50, 28, and 16, from top to bottom) is proportional to the execution time. The animation can be visualized by opening this PDF with Adobe. Visualizing the router and core activity simultaneously helps understand the effect of NoC congestion on core utilization. In addition, plotting the destination-port collisions helps understand the router activity. We evaluated the version of BFS with barrier synchronization at the end of each epoch to showcase the effect of the network on the tail execution time (3 major epochs can be observed). A finer time resolution allows observing the evolution of the execution in more detail, but it increases the size of the GIF.
Figure 3: Ratio between the simulator and the DUT runtime, for two DUT sizes (32x32 and 64x64 tiles, monolithic, connected via a 64-bit 2D torus), evaluated with an increasing number of host threads to process the same RMAT-22 dataset. The DUT time is considered as the aggregated runtime of all tiles. The simulator runtime decreases close to linearly with the number of host threads.
Figure 4: Simulation time (in host seconds) and throughput in DUT operations and NoC message fits routed per second (y-axis), for scaling DUT sizes from a thousand to a million tiles (x-axis) when processing the RMAT-26 dataset. (This evaluation models 32x32 tiles per chiplet, connected via a 64-bit hierarchical 2D torus.) The $2^{10}$ and $2^{12}$ datapoints are evaluated with 16 and 32 host threads, respectively, of a single-socket Intel Xeon Gold 6342 at 2.8Ghz, the datapoints from $2^{14}$ to $2^{18}$ with 64 and $2^{20}$ with 128 threads, respectively, of a 4-socket Intel Xeon Gold 6230 at 2.1GHz.
Figure 5: Performance, energy efficiency, and performance per dollar improvements of the DUT using different SRAM sizes and number of tiles per HBM channel, over a baseline of 64 KiB SRAM and 128 tiles per HBM channel (Tile/Ch). In this study, a chiplet is always attached to a single 8-channel HBM device, and thus, the number of tiles per chiplet (16x16 or 32x32) determines the ratio of tiles per HBM channel. The RMAT-25 dataset is studied on a DUT with 1024 tiles; the dataset footprint per tile ranges from 4 MiB (Histogram) to 8 MiB per tile (SPMV).

Muchisim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems

TL;DR

Abstract

Muchisim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (5)