Table of Contents
Fetching ...

Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision

Nathaniel Hudson, J. Gregory Pauloski, Matt Baughman, Alok Kamatar, Mansi Sakarvadia, Logan Ward, Ryan Chard, André Bauer, Maksim Levental, Wenyi Wang, Will Engler, Owen Price Skelly, Ben Blaiszik, Rick Stevens, Kyle Chard, Ian Foster

TL;DR

The paper argues that realizing TPM-powered scientific discovery requires a dedicated serving ecosystem that scales to exascale data and compute. It articulates a vision for distributed, interface-rich TPMs hosted by research computing centers, with multiple inference modes (Query Only, Customized Embeddings, Intermediary Accesses) to support a wide range of scientific workflows. It surveys historical context and outlines challenges across access control, resource management, knowledge updating, interoperability, and efficient inference, proposing a comprehensive software stack and governance model. The work highlights practical steps—APIs, versioned TPMs, embedding pipelines, and hookable internals—that would enable reproducible, collaborative, and flexible use of TPMs in science, ultimately accelerating discovery and innovation.

Abstract

Deep learning methods are transforming research, enabling new techniques, and ultimately leading to new discoveries. As the demand for more capable AI models continues to grow, we are now entering an era of Trillion Parameter Models (TPM), or models with more than a trillion parameters -- such as Huawei's PanGu-$Σ$. We describe a vision for the ecosystem of TPM users and providers that caters to the specific needs of the scientific community. We then outline the significant technical challenges and open problems in system design for serving TPMs to enable scientific research and discovery. Specifically, we describe the requirements of a comprehensive software stack and interfaces to support the diverse and flexible requirements of researchers.

Trillion Parameter AI Serving Infrastructure for Scientific Discovery: A Survey and Vision

TL;DR

The paper argues that realizing TPM-powered scientific discovery requires a dedicated serving ecosystem that scales to exascale data and compute. It articulates a vision for distributed, interface-rich TPMs hosted by research computing centers, with multiple inference modes (Query Only, Customized Embeddings, Intermediary Accesses) to support a wide range of scientific workflows. It surveys historical context and outlines challenges across access control, resource management, knowledge updating, interoperability, and efficient inference, proposing a comprehensive software stack and governance model. The work highlights practical steps—APIs, versioned TPMs, embedding pipelines, and hookable internals—that would enable reproducible, collaborative, and flexible use of TPMs in science, ultimately accelerating discovery and innovation.

Abstract

Deep learning methods are transforming research, enabling new techniques, and ultimately leading to new discoveries. As the demand for more capable AI models continues to grow, we are now entering an era of Trillion Parameter Models (TPM), or models with more than a trillion parameters -- such as Huawei's PanGu-. We describe a vision for the ecosystem of TPM users and providers that caters to the specific needs of the scientific community. We then outline the significant technical challenges and open problems in system design for serving TPMs to enable scientific research and discovery. Specifically, we describe the requirements of a comprehensive software stack and interfaces to support the diverse and flexible requirements of researchers.
Paper Structure (27 sections, 3 figures)

This paper contains 27 sections, 3 figures.

Figures (3)

  • Figure 1: (\ref{['fig:model_benchmark']}) Size of AI language models over the past few years. The gray line is the trend-line as a rolling average. The top-right corner features 3 TPMs: Switch-C Transformer fedus2022switch, PanGu-$\Sigma$ren2023pangusigma, and (rumored) GPT-4 gpt4. (\ref{['fig:system_benchmark']}) The benchmarked AI-PetaFLOPS of state-of-the-art HPC systems over time. Note: An asterisk means the anticipated benchmark numbers.
  • Figure 2: The three envisioned modes of inference. Boxes represent the modular components that go into the respective serving modes; the trapezoids indicate the model portions for ingesting data and generating outputs; and the arrows denote communication between the modules.
  • Figure 3: Interpreting the residual stream with a lens framework that can directly give intermediate outputs from individual attention heads in GPT-like LLMs. This figure uses input, intermediate, and output values from GPT2-Large.