Comparative Study of Large Language Model Architectures on Frontier

Junqi Yin; Avishek Bose; Guojing Cong; Isaac Lyngaas; Quentin Anthony

Comparative Study of Large Language Model Architectures on Frontier

Junqi Yin, Avishek Bose, Guojing Cong, Isaac Lyngaas, Quentin Anthony

TL;DR

This work tackles the lack of controlled, architecture-level comparisons for open-source GPT variants in scientific domains by pretraining and evaluating materials-science foundation models (MatGPT) on Frontier. It conducts a rigorous, end-to-end comparison of GPT-NeoX and LLaMA architectures using a domain-specific corpus totaling $15B$ tokens, exploring tokenization, architecture search, and HPC optimizations (including flash attention and LAMB) on AMD GPUs. The study demonstrates near-parity in general benchmarks between the two architectures, while showing substantial gains in energy efficiency and scalability on HPC hardware; importantly, embedding-derived features from MatGPT combined with GNNs achieve state-of-the-art band-gap prediction on a materials science task. The results provide practical guidance for deploying LLMs on HPC platforms and contribute domain-specific models (MatGPT) to materials science research, including a scientifically meaningful downstream task and release of pre-trained models for community use.

Abstract

Large language models (LLMs) have garnered significant attention in both the AI community and beyond. Among these, the Generative Pre-trained Transformer (GPT) has emerged as the dominant architecture, spawning numerous variants. However, these variants have undergone pre-training under diverse conditions, including variations in input data, data preprocessing, and training methodologies, resulting in a lack of controlled comparative studies. Here we meticulously examine two prominent open-sourced GPT architectures, GPT-NeoX and LLaMA, leveraging the computational power of Frontier, the world's first Exascale supercomputer. Employing the same materials science text corpus and a comprehensive end-to-end pipeline, we conduct a comparative analysis of their training and downstream performance. Our efforts culminate in achieving state-of-the-art performance on a challenging materials science benchmark. Furthermore, we investigate the computation and energy efficiency, and propose a computationally efficient method for architecture design. To our knowledge, these pre-trained models represent the largest available for materials science. Our findings provide practical guidance for building LLMs on HPC platforms.

Comparative Study of Large Language Model Architectures on Frontier

TL;DR

tokens, exploring tokenization, architecture search, and HPC optimizations (including flash attention and LAMB) on AMD GPUs. The study demonstrates near-parity in general benchmarks between the two architectures, while showing substantial gains in energy efficiency and scalability on HPC hardware; importantly, embedding-derived features from MatGPT combined with GNNs achieve state-of-the-art band-gap prediction on a materials science task. The results provide practical guidance for deploying LLMs on HPC platforms and contribute domain-specific models (MatGPT) to materials science research, including a scientifically meaningful downstream task and release of pre-trained models for community use.

Abstract

Paper Structure (7 sections, 1 equation, 17 figures, 5 tables)

This paper contains 7 sections, 1 equation, 17 figures, 5 tables.

Introduction
Related work
Method
Evaluation
Experiment setup
Results
Conclusion

Figures (17)

Figure 1: Evolution of LLM architecture since 2018. Starting from 2021, the GPT architecture dominates the major model releases.
Figure 2: Transformer layer of GPT-NeoX and LLaMA architecture, respectively. The specific parameter and FLOP numbers are for 1.7B parameter model with a sequence length of 2048 and batch size of 16.
Figure 3: A new scientific usage of LLM: combining LLM embeddings with GNN for material properties prediction.
Figure 4: (Left) The heatmap of training throughput (TFLOPS per GPU) for MatGPT with various numbers of layers and hidden sizes for model size around 1B. (Right) The performance boost for architectures eligible for flash attention, including v1 and v2, respectively.
Figure 5: The peak memory usage (percentage) during the training of MatGPT 1.7B with and without flash attention for context sequence length from 2,048 to 32,768.
...and 12 more figures

Comparative Study of Large Language Model Architectures on Frontier

TL;DR

Abstract

Comparative Study of Large Language Model Architectures on Frontier

Authors

TL;DR

Abstract

Table of Contents

Figures (17)