Table of Contents
Fetching ...

ORBIT: Oak Ridge Base Foundation Model for Earth System Predictability

Xiao Wang, Siyan Liu, Aristeidis Tsaris, Jong-Youl Choi, Ashwin Aji, Ming Fan, Wei Zhang, Junqi Yin, Moetasim Ashfaq, Dan Lu, Prasanna Balaprakash

TL;DR

ORBIT addresses the challenge of Earth system predictability by scaling a vision-transformer foundation model to 113B parameters and integrating 91 climate-variable channels with CMIP6 data. It introduces Hybrid Sharded Tensor-Data Orthogonal Parallelism (Hybrid-STOP) to achieve architecture-agnostic, exascale training on Frontier, delivering sustained throughput from 684 PFLOPS to 1.6 EFLOPS. Pretraining on CMIP6 and ERA5 fine-tuning demonstrate improved data efficiency and long-lead forecasting accuracy, outperforming state-of-the-art baselines on 14- and 30-day horizons and approaching or exceeding baseline performance on shorter leads. The work has broad implications for climate science and HPC, providing a scalable, hardware-inclusive blueprint for cross-domain AI applications on diverse hardware ecosystems.

Abstract

Earth system predictability is challenged by the complexity of environmental dynamics and the multitude of variables involved. Current AI foundation models, although advanced by leveraging large and heterogeneous data, are often constrained by their size and data integration, limiting their effectiveness in addressing the full range of Earth system prediction challenges. To overcome these limitations, we introduce the Oak Ridge Base Foundation Model for Earth System Predictability (ORBIT), an advanced vision transformer model that scales up to 113 billion parameters using a novel hybrid tensor-data orthogonal parallelism technique. As the largest model of its kind, ORBIT surpasses the current climate AI foundation model size by a thousandfold. Performance scaling tests conducted on the Frontier supercomputer have demonstrated that ORBIT achieves 684 petaFLOPS to 1.6 exaFLOPS sustained throughput, with scaling efficiency maintained at 41% to 85% across 49,152 AMD GPUs. These breakthroughs establish new advances in AI-driven climate modeling and demonstrate promise to significantly improve the Earth system predictability.

ORBIT: Oak Ridge Base Foundation Model for Earth System Predictability

TL;DR

ORBIT addresses the challenge of Earth system predictability by scaling a vision-transformer foundation model to 113B parameters and integrating 91 climate-variable channels with CMIP6 data. It introduces Hybrid Sharded Tensor-Data Orthogonal Parallelism (Hybrid-STOP) to achieve architecture-agnostic, exascale training on Frontier, delivering sustained throughput from 684 PFLOPS to 1.6 EFLOPS. Pretraining on CMIP6 and ERA5 fine-tuning demonstrate improved data efficiency and long-lead forecasting accuracy, outperforming state-of-the-art baselines on 14- and 30-day horizons and approaching or exceeding baseline performance on shorter leads. The work has broad implications for climate science and HPC, providing a scalable, hardware-inclusive blueprint for cross-domain AI applications on diverse hardware ecosystems.

Abstract

Earth system predictability is challenged by the complexity of environmental dynamics and the multitude of variables involved. Current AI foundation models, although advanced by leveraging large and heterogeneous data, are often constrained by their size and data integration, limiting their effectiveness in addressing the full range of Earth system prediction challenges. To overcome these limitations, we introduce the Oak Ridge Base Foundation Model for Earth System Predictability (ORBIT), an advanced vision transformer model that scales up to 113 billion parameters using a novel hybrid tensor-data orthogonal parallelism technique. As the largest model of its kind, ORBIT surpasses the current climate AI foundation model size by a thousandfold. Performance scaling tests conducted on the Frontier supercomputer have demonstrated that ORBIT achieves 684 petaFLOPS to 1.6 exaFLOPS sustained throughput, with scaling efficiency maintained at 41% to 85% across 49,152 AMD GPUs. These breakthroughs establish new advances in AI-driven climate modeling and demonstrate promise to significantly improve the Earth system predictability.
Paper Structure (15 sections, 3 equations, 10 figures, 1 table)

This paper contains 15 sections, 3 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Architecture of ClimaX foundational model.
  • Figure 2: Fully Sharded Data Parallelism (FSDP) forward and backward pass.
  • Figure 3: Hybrid-STOP forward and backward pass. GPUs 1 and 2 are an example FSDP group, highlighted by a red rectangular box. GPUs 1 and 3 are an example tensor-parallel group, highlighted by purple dash lines.
  • Figure 4: Hierarchical parallelism of the Hybrid-STOP. Each horizontal purple rectangle represents a tensor-parallel group. Vertical red rectangles represent FSDP groups. Green rectangles represent DDP groups.
  • Figure 5: The maximal model size that each parallelism can scale to at different numbers of GPUs.
  • ...and 5 more figures