Table of Contents
Fetching ...

General-Purpose Multicore Architectures

Saugata Ghose

TL;DR

This work analyzes the shift from ILP-driven single-core CPUs to multicore architectures driven by power density and scaling limits. It surveys multicore microarchitecture, memory hierarchies, coherence protocols, and OS-level optimizations, clarifying how concurrency, cache sharing, and memory scheduling shape performance. It highlights key design trends such as DVFS, cache slicing, SoCs, heterogeneous cores, and chiplet-based designs, and discusses evaluation metrics that capture throughput, fairness, and energy efficiency. The findings emphasize that effective multicore CPUs rely on coordinated hardware and software strategies to manage memory interference, coherence, and heterogeneity, enabling scalable performance across embedded to data-center systems, with ongoing evolution toward modular, energy-aware designs.

Abstract

The first years of the 2000s led to an inflection point in computer architectures: while the number of available transistors on a chip continued to grow, crucial transistor scaling properties started to break down and result in increasing power consumption, while aggressive single-core performance optimizations were resulting in diminishing returns due to inherent limits in instruction-level parallelism. This led to the rise of multicore CPU architectures, which are now commonplace in modern computers at all scales. In this chapter, we discuss the evolution of multicore CPUs since their introduction. Starting with a historic overview of multiprocessing, we explore the basic microarchitecture of a multicore CPU, key challenges resulting from shared memory resources, operating system modifications to optimize multicore CPU support, popular metrics for multicore evaluation, and recent trends in multicore CPU design.

General-Purpose Multicore Architectures

TL;DR

This work analyzes the shift from ILP-driven single-core CPUs to multicore architectures driven by power density and scaling limits. It surveys multicore microarchitecture, memory hierarchies, coherence protocols, and OS-level optimizations, clarifying how concurrency, cache sharing, and memory scheduling shape performance. It highlights key design trends such as DVFS, cache slicing, SoCs, heterogeneous cores, and chiplet-based designs, and discusses evaluation metrics that capture throughput, fairness, and energy efficiency. The findings emphasize that effective multicore CPUs rely on coordinated hardware and software strategies to manage memory interference, coherence, and heterogeneity, enabling scalable performance across embedded to data-center systems, with ongoing evolution toward modular, energy-aware designs.

Abstract

The first years of the 2000s led to an inflection point in computer architectures: while the number of available transistors on a chip continued to grow, crucial transistor scaling properties started to break down and result in increasing power consumption, while aggressive single-core performance optimizations were resulting in diminishing returns due to inherent limits in instruction-level parallelism. This led to the rise of multicore CPU architectures, which are now commonplace in modern computers at all scales. In this chapter, we discuss the evolution of multicore CPUs since their introduction. Starting with a historic overview of multiprocessing, we explore the basic microarchitecture of a multicore CPU, key challenges resulting from shared memory resources, operating system modifications to optimize multicore CPU support, popular metrics for multicore evaluation, and recent trends in multicore CPU design.
Paper Structure (41 sections, 12 equations, 15 figures)

This paper contains 41 sections, 12 equations, 15 figures.

Figures (15)

  • Figure 1: Log--linear plot of selected CPUs introduced between 1971 and 2024, illustrating the progression of Moore's Law.
  • Figure 2: An illustration of Flynn's taxonomy.
  • Figure 3: Amdahl's Law visualized for an example application.
  • Figure 4: Comparison of theoretical parallel speedup estimated by Amdahl's Law and Gustafson's Law, for different parallelizable fractions $f$. Inset graphs show a zoomed-in section of the main graph for clarity.
  • Figure 5: A synthetic example of observable behavior in a real-world parallel system. The dotted line shows ideal parallelism ($f=1$ with Amdahl's Law), the dashed line shows expected performance ($f=0.99$ with Amdahl's Law), and the solid line shows observed performance.
  • ...and 10 more figures