Comparative Study of Large Language Model Architectures on Frontier
Junqi Yin, Avishek Bose, Guojing Cong, Isaac Lyngaas, Quentin Anthony
TL;DR
This work tackles the lack of controlled, architecture-level comparisons for open-source GPT variants in scientific domains by pretraining and evaluating materials-science foundation models (MatGPT) on Frontier. It conducts a rigorous, end-to-end comparison of GPT-NeoX and LLaMA architectures using a domain-specific corpus totaling $15B$ tokens, exploring tokenization, architecture search, and HPC optimizations (including flash attention and LAMB) on AMD GPUs. The study demonstrates near-parity in general benchmarks between the two architectures, while showing substantial gains in energy efficiency and scalability on HPC hardware; importantly, embedding-derived features from MatGPT combined with GNNs achieve state-of-the-art band-gap prediction on a materials science task. The results provide practical guidance for deploying LLMs on HPC platforms and contribute domain-specific models (MatGPT) to materials science research, including a scientifically meaningful downstream task and release of pre-trained models for community use.
Abstract
Large language models (LLMs) have garnered significant attention in both the AI community and beyond. Among these, the Generative Pre-trained Transformer (GPT) has emerged as the dominant architecture, spawning numerous variants. However, these variants have undergone pre-training under diverse conditions, including variations in input data, data preprocessing, and training methodologies, resulting in a lack of controlled comparative studies. Here we meticulously examine two prominent open-sourced GPT architectures, GPT-NeoX and LLaMA, leveraging the computational power of Frontier, the world's first Exascale supercomputer. Employing the same materials science text corpus and a comprehensive end-to-end pipeline, we conduct a comparative analysis of their training and downstream performance. Our efforts culminate in achieving state-of-the-art performance on a challenging materials science benchmark. Furthermore, we investigate the computation and energy efficiency, and propose a computationally efficient method for architecture design. To our knowledge, these pre-trained models represent the largest available for materials science. Our findings provide practical guidance for building LLMs on HPC platforms.
