Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

Yiqi Liu; Yuqi Xue; Yu Cheng; Lingxiao Ma; Ziming Miao; Jilong Xue; Jian Huang

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang

TL;DR

The paper tackles memory bandwidth and on-chip memory challenges in inter-core connected AI chips by introducing T10, a DL compiler that uses a compute-shift execution model and the rTensor abstraction to coordinate data movement across thousands of cores. T10 employs a two-stage optimization: intra-operator Pareto-optimal plans to balance time and memory, and inter-operator memory reconciliation to optimize end-to-end execution on distributed on-chip memory. It provides a cost-aware optimization framework, an extensible hardware-mapping interface, and an IPU-focused implementation, achieving up to 3.3x end-to-end speedups over state-of-the-art compilers and enabling larger models and LLM workloads. The results demonstrate significant improvements in on-chip data reuse, reduced inter-core communication, and scalable performance on Graphcore IPUs, highlighting the practical potential of compute-shift-based compilation for future inter-core AI accelerators.

Abstract

As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for the scalable inter-core connections in current DL compilers, it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in this new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a generalized compute-shift pattern, by partitioning DNN computation into sub-operators and mapping them to cores, so that the cores can exchange data following predictable patterns. T10 makes globally optimized trade-offs between on-chip memory consumption and inter-core communication overhead, selects the best execution plan from a vast optimization space, and alleviates unnecessary inter-core communications. Our evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows up to 3.3$\times$ performance improvement, and scalability support for larger models, compared to state-of-the-art DL compilers and vendor libraries.

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

TL;DR

Abstract

performance improvement, and scalability support for larger models, compared to state-of-the-art DL compilers and vendor libraries.

Paper Structure (25 sections, 2 equations, 24 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 2 equations, 24 figures, 3 tables, 1 algorithm.

Introduction
Background and Motivation
Inter-core Connected Intelligence Processor
Inefficiency of Existing Approaches
Core Idea of T10
System Design of T10
rTensor: A New Tensor Abstraction
Compute-Shift Execution Plan
Intra-operator and Inter-operator Trade-off
Searching Pareto-optimal Intra-operator Plans
Holistic Inter-operator Memory Reconciliation
Mapping to the Hardware Accelerator
Implementation Details
Evaluation
Experimental Setup
...and 10 more sections

Figures (24)

Figure 1: System architecture of TPU (left), GPU (middle), and IPU (right) chips.
Figure 2: A comparison of the conventional load-compute-store (a) vs. our compute-shift (c) style execution. (b) shows the per-core memory footprint of representative operators when running DNN models on IPU using VGM. Ratio is the potential increase in sub-operator size by removing VGM. The result of OPT13B opt comes from profiling one of its layers on IPU.
Figure 3: An example that maps a MatMul operator to two cores with the compute-shift style execution. Both (b) and (c) are valid compute-shift execution plans, but with different tradeoffs between memory footprint and communication overhead.
Figure 4: System overview of T10.
Figure 5: rTensor abstraction in T10.
...and 19 more figures

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

TL;DR

Abstract

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

Authors

TL;DR

Abstract

Table of Contents

Figures (24)