Table of Contents
Fetching ...

Are Protein Language Models Compute Optimal?

Yaiza Serrano, Álvaro Ciudad, Alexis Molina

TL;DR

The paper investigates whether protein language models (pLMs) follow compute-optimal scaling laws like NLP models under a fixed FLOPs budget. By adapting Kaplan et al. and Hoffmann et al. frameworks to encoder-only pLMs and using UniRef50 subsets, it derives how the optimal model size $N_{ ext{opt}}$ and optimal token count $D_{ ext{opt}}$ scale with compute $C$, finding $N_{ ext{opt}} \propto C^{0.27}$ and $D_{ ext{opt}} \propto C^{0.71}$ and revealing a training-loss plateau near $L \approx 2.4$ largely insensitive to $N$ and $D$. Notably, a 35M-parameter model trained on a reduced dataset in a single pass achieves perplexity about $11.43$, closely matching or outperforming some much larger models when compute is limited, highlighting the potential for compute-efficient pLMs. The work argues that NLP scaling laws do not directly apply to pLMs and advocates for compute-aware model design to democratize access and accelerate applications in computational biology, while outlining directions for future exploration with larger models, multi-pass training, and downstream evaluation.

Abstract

While protein language models (pLMs) have transformed biological research, the scaling laws governing their improvement remain underexplored. By adapting methodologies from NLP scaling laws, we investigated the optimal ratio between model parameters and training tokens within a fixed compute budget. Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns in performance as model size increases, and we identify a performance plateau in training loss comparable to the one found in relevant works in the field. Our findings suggest that widely-used pLMs might not be compute-optimal, indicating that larger models could achieve convergence more efficiently. Training a 35M model on a reduced token set, we attained perplexity results comparable to larger models like ESM-2 (15B) and xTrimoPGLM (100B) with a single dataset pass. This work paves the way towards more compute-efficient pLMs, democratizing their training and practical application in computational biology.

Are Protein Language Models Compute Optimal?

TL;DR

The paper investigates whether protein language models (pLMs) follow compute-optimal scaling laws like NLP models under a fixed FLOPs budget. By adapting Kaplan et al. and Hoffmann et al. frameworks to encoder-only pLMs and using UniRef50 subsets, it derives how the optimal model size and optimal token count scale with compute , finding and and revealing a training-loss plateau near largely insensitive to and . Notably, a 35M-parameter model trained on a reduced dataset in a single pass achieves perplexity about , closely matching or outperforming some much larger models when compute is limited, highlighting the potential for compute-efficient pLMs. The work argues that NLP scaling laws do not directly apply to pLMs and advocates for compute-aware model design to democratize access and accelerate applications in computational biology, while outlining directions for future exploration with larger models, multi-pass training, and downstream evaluation.

Abstract

While protein language models (pLMs) have transformed biological research, the scaling laws governing their improvement remain underexplored. By adapting methodologies from NLP scaling laws, we investigated the optimal ratio between model parameters and training tokens within a fixed compute budget. Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns in performance as model size increases, and we identify a performance plateau in training loss comparable to the one found in relevant works in the field. Our findings suggest that widely-used pLMs might not be compute-optimal, indicating that larger models could achieve convergence more efficiently. Training a 35M model on a reduced token set, we attained perplexity results comparable to larger models like ESM-2 (15B) and xTrimoPGLM (100B) with a single dataset pass. This work paves the way towards more compute-efficient pLMs, democratizing their training and practical application in computational biology.
Paper Structure (15 sections, 6 equations, 3 figures, 6 tables)