Table of Contents
Fetching ...

BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models

Yi Fang, Haoran Xu, Jiaxin Han, Sirui Ding, Yizhi Wang, Yue Wang, Xuan Wang

TL;DR

BioArc introduces a principled Neural Architecture Search framework to automatically discover optimized neural architectures for biological foundation models, addressing the mismatch between generic AI architectures and biology-specific data grammars. Through a weight-sharing one-shot supernet, diverse architecture blocks (CNNs, Transformers, Hyena, Mamba, LSTM) and a carefully pruned search space, BioArc identifies hybrid designs (notably Hyena-Transformer-CNN) that excel on DNA and protein tasks, often outperforming larger pretrained models with far fewer parameters. The framework also characterizes how tokenization and training strategies interact with architecture, and it provides several architecture-prediction approaches, including a BioArc Agent that leverages retrieval and reasoning to predict top architectures for new tasks. The work culminates in a foundation-model backbone that matches or exceeds task-specific architectures, with scalable evidence across modalities, suggesting a practical path toward the next generation of biology-focused foundation models.

Abstract

Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars'' inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.

BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models

TL;DR

BioArc introduces a principled Neural Architecture Search framework to automatically discover optimized neural architectures for biological foundation models, addressing the mismatch between generic AI architectures and biology-specific data grammars. Through a weight-sharing one-shot supernet, diverse architecture blocks (CNNs, Transformers, Hyena, Mamba, LSTM) and a carefully pruned search space, BioArc identifies hybrid designs (notably Hyena-Transformer-CNN) that excel on DNA and protein tasks, often outperforming larger pretrained models with far fewer parameters. The framework also characterizes how tokenization and training strategies interact with architecture, and it provides several architecture-prediction approaches, including a BioArc Agent that leverages retrieval and reasoning to predict top architectures for new tasks. The work culminates in a foundation-model backbone that matches or exceeds task-specific architectures, with scalable evidence across modalities, suggesting a practical path toward the next generation of biology-focused foundation models.

Abstract

Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars'' inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.

Paper Structure

This paper contains 94 sections, 19 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: This figure provides an overview of the four core stages of the BioArc framework. (1) Search Space Design: Defines diverse module types, network depths, and hidden dimensions, using pruning and clustering strategies to manage the combinatorial explosion. (2) Supernet Construction: Encodes the vast search space into a single, weight-sharing Supernet, with each path representing a candidate architecture. (3) Supernet Training: Adopts a one-shot methodology, sampling and optimizing different paths across batches for efficiency. (4) BioArc Agent: An intelligent agent that analyzes user tasks, leverages a knowledge base to retrieve similar tasks and top architectures, and predict top architectures from the supernet.
  • Figure 2: This figure illustrates the top five performing DNA model architectures (Arch 1-5), identified by averaging their performance across the different tasks on which they were directly trained from scratch. These architectures are composed of combinations of HYENA, Transformer and CNN modules. It is observable that these high-performing architectures shares a common pattern. Results on protein is shown in Appendix \ref{['Appendix:architecture pattern']}.
  • Figure 3: Performance of BioArc-Discovered Architecture as a Foundation Model Backbone, noted as BioArc-F. The architecture is selected by the top average performance across all tasks and pretrained for 1/10 training steps of baselines.
  • Figure 4: Performance of different training strategies on DNA. In the left panel, each cell shows the mean performance ± standard deviation across all architectures. In the right panel, each bar shows the percentage of the total 360 architectures that chose that training strategy to yield the best performance. More results on Protein could be found in Appendix \ref{['Appendix:training results']}.
  • Figure 5: Effect of different tokenization on various architectures training from scratch on the DNA tasks. More results could be found in Appendix \ref{['Appendix:tokenizers results']}.
  • ...and 9 more figures