Table of Contents
Fetching ...

Specialized Foundation Models Struggle to Beat Supervised Baselines

Zongzhe Xu, Ritvik Gupta, Wenduo Cheng, Alexander Shen, Junhong Shen, Ameet Talwalkar, Mikhail Khodak

TL;DR

This work questions whether specialized foundation models (FMs) pretrained on vast data truly outperform traditional supervised approaches in domains like genomics, satellite imaging, and time series. The authors introduce two automated pipelines—DASHA for CNN-based architecture search and Auto-AR for tuning autoregressive forecasters—to create strong, domain-specific supervised baselines using only target-task data. Across over fifty tasks, simple, well-tuned CNNs and linear autoregressive models often match or exceed the performance of open-source FMs, sometimes by substantial margins, while also offering far greater computational efficiency. The findings suggest that, in these specialized domains, the transfer benefits of large pretraining are not yet realized, underscoring the importance of robust baselines and providing open-source tools to facilitate fair FM evaluation and benchmarking.

Abstract

Following its success for vision and text, the "foundation model" (FM) paradigm -- pretraining large models on massive data, then fine-tuning on target tasks -- has rapidly expanded to domains in the sciences, engineering, healthcare, and beyond. Has this achieved what the original FMs accomplished, i.e. the supplanting of traditional supervised learning in their domains? To answer we look at three modalities -- genomics, satellite imaging, and time series -- with multiple recent FMs and compare them to a standard supervised learning workflow: model development, hyperparameter tuning, and training, all using only data from the target task. Across these three specialized domains, we find that it is consistently possible to train simple supervised models -- no more complicated than a lightly modified wide ResNet or UNet -- that match or even outperform the latest foundation models. Our work demonstrates that the benefits of large-scale pretraining have yet to be realized in many specialized areas, reinforces the need to compare new FMs to strong, well-tuned baselines, and introduces two new, easy-to-use, open-source, and automated workflows for doing so.

Specialized Foundation Models Struggle to Beat Supervised Baselines

TL;DR

This work questions whether specialized foundation models (FMs) pretrained on vast data truly outperform traditional supervised approaches in domains like genomics, satellite imaging, and time series. The authors introduce two automated pipelines—DASHA for CNN-based architecture search and Auto-AR for tuning autoregressive forecasters—to create strong, domain-specific supervised baselines using only target-task data. Across over fifty tasks, simple, well-tuned CNNs and linear autoregressive models often match or exceed the performance of open-source FMs, sometimes by substantial margins, while also offering far greater computational efficiency. The findings suggest that, in these specialized domains, the transfer benefits of large pretraining are not yet realized, underscoring the importance of robust baselines and providing open-source tools to facilitate fair FM evaluation and benchmarking.

Abstract

Following its success for vision and text, the "foundation model" (FM) paradigm -- pretraining large models on massive data, then fine-tuning on target tasks -- has rapidly expanded to domains in the sciences, engineering, healthcare, and beyond. Has this achieved what the original FMs accomplished, i.e. the supplanting of traditional supervised learning in their domains? To answer we look at three modalities -- genomics, satellite imaging, and time series -- with multiple recent FMs and compare them to a standard supervised learning workflow: model development, hyperparameter tuning, and training, all using only data from the target task. Across these three specialized domains, we find that it is consistently possible to train simple supervised models -- no more complicated than a lightly modified wide ResNet or UNet -- that match or even outperform the latest foundation models. Our work demonstrates that the benefits of large-scale pretraining have yet to be realized in many specialized areas, reinforces the need to compare new FMs to strong, well-tuned baselines, and introduces two new, easy-to-use, open-source, and automated workflows for doing so.

Paper Structure

This paper contains 42 sections, 8 figures, 21 tables, 1 algorithm.

Figures (8)

  • Figure 1: Across genomics, satellite imaging, and time series, specialized FMs fail to significantly improve upon supervised learning despite using two-to-five orders of magnitude more data. In contrast, breakthrough FMs such as BERT dramatically outperformed supervised NLP baselines (top left), causing the field to switch to fine-tuning as the default approach. For each domain we plot total pretraining and fine-tuning data vs. mean improvement over the supervised state-of-the-art. The NLP results are from the GLUE benchmark wang2019glue while evaluations of the last three domains are in Section \ref{['sec:empirical']}. Note that for the NLP x-axis we ignore word embedding pretraining tokens.
  • Figure 2: Our goal is to compare the pretrain-then-fine-tune paradigm (top) with a standard supervised workflow (bottom) on the tasks on which specialized FMs are evaluated. While for time series we go through a traditional process of developing and tuning a supervised model, this manual approach does not scale to many domains; as a result, in Section \ref{['sec:dasha']} we develop a way to simulate it using architecture search. Note that FM fine-tuning hyperparameters are not always tuned in practice, but we assume their creators make a best-effort attempt to present their own method in the best light.
  • Figure 3: Performance of the best FMs when given only 20% as much fine-tuning data (c.f. Table \ref{['tab:subsampling']}). Supervised baselines are competitive even in this data-scarce regime: in satellite imaging and time series they are outperformed by just one FM, while in genomics they are beaten by two. Notably the worst of the three genomics FMs in the full setting does best with less data.
  • Figure 4: Tuning costs (excluding FM pretraining but including HPO for all methods and NAS for DASHA) and downstream fine-tuning / re-training costs (c.f. Table \ref{['tab:compute']}). Apart from Caduceus, supervised models are much cheaper, and for time series the entire Auto-AR pipeline is $3.5\times$ faster than fine-tuning any FM once (n.b. tuning costs for time series FMs are unknown).
  • Figure 5: PCA visualization of architectures discovered for three different tasks when DASHA is run multiples times. Clustering across tasks reveals the within-task consistency of the architecture search component and the utility of diverse models as baselines.
  • ...and 3 more figures