Table of Contents
Fetching ...

A Model Zoo of Vision Transformers

Damian Falk, Léo Meynent, Florence Pfammatter, Konstantin Schürholt, Damian Borth

TL;DR

Addresses the lack of transformer-focused model zoos in computer vision by building the first ViT-based model zoo with $250$ models generated via a two-stage training framework, formalized as $(\mathcal{A}_{P}, \mathcal{D}_{P}, \lambda_{P})$ for pre-training and $(\mathcal{A}_{F}, \mathcal{D}_{F}, \lambda_{F})$ for fine-tuning. The authors analyze weight-space and behavioral diversity across a structured hyperparameter grid, reveal distinct modes tied to pre-training and fine-tuning choices, and demonstrate initial applications such as model lineage prediction with MoTHeR and weight averaging through model soups and epoch averaging, as well as Git Re-Basin limitations. The zoo, publicly released, provides a resource to scale population-based methods to state-of-the-art transformer architectures and to study learning dynamics, representation transfer, and robustness in vision transformers. Overall, the work establishes a benchmark dataset and a methodological platform for transformer-population research in vision, enabling systematic exploration of how different training factors shape large ViT populations.

Abstract

The availability of large, structured populations of neural networks - called 'model zoos' - has led to the development of a multitude of downstream tasks ranging from model analysis, to representation learning on model weights or generative modeling of neural network parameters. However, existing model zoos are limited in size and architecture and neglect the transformer, which is among the currently most successful neural network architectures. We address this gap by introducing the first model zoo of vision transformers (ViT). To better represent recent training approaches, we develop a new blueprint for model zoo generation that encompasses both pre-training and fine-tuning steps, and publish 250 unique models. They are carefully generated with a large span of generating factors, and their diversity is validated using a thorough choice of weight-space and behavioral metrics. To further motivate the utility of our proposed dataset, we suggest multiple possible applications grounded in both extensive exploratory experiments and a number of examples from the existing literature. By extending previous lines of similar work, our model zoo allows researchers to push their model population-based methods from the small model regime to state-of-the-art architectures. We make our model zoo available at github.com/ModelZoos/ViTModelZoo.

A Model Zoo of Vision Transformers

TL;DR

Addresses the lack of transformer-focused model zoos in computer vision by building the first ViT-based model zoo with models generated via a two-stage training framework, formalized as for pre-training and for fine-tuning. The authors analyze weight-space and behavioral diversity across a structured hyperparameter grid, reveal distinct modes tied to pre-training and fine-tuning choices, and demonstrate initial applications such as model lineage prediction with MoTHeR and weight averaging through model soups and epoch averaging, as well as Git Re-Basin limitations. The zoo, publicly released, provides a resource to scale population-based methods to state-of-the-art transformer architectures and to study learning dynamics, representation transfer, and robustness in vision transformers. Overall, the work establishes a benchmark dataset and a methodological platform for transformer-population research in vision, enabling systematic exploration of how different training factors shape large ViT populations.

Abstract

The availability of large, structured populations of neural networks - called 'model zoos' - has led to the development of a multitude of downstream tasks ranging from model analysis, to representation learning on model weights or generative modeling of neural network parameters. However, existing model zoos are limited in size and architecture and neglect the transformer, which is among the currently most successful neural network architectures. We address this gap by introducing the first model zoo of vision transformers (ViT). To better represent recent training approaches, we develop a new blueprint for model zoo generation that encompasses both pre-training and fine-tuning steps, and publish 250 unique models. They are carefully generated with a large span of generating factors, and their diversity is validated using a thorough choice of weight-space and behavioral metrics. To further motivate the utility of our proposed dataset, we suggest multiple possible applications grounded in both extensive exploratory experiments and a number of examples from the existing literature. By extending previous lines of similar work, our model zoo allows researchers to push their model population-based methods from the small model regime to state-of-the-art architectures. We make our model zoo available at github.com/ModelZoos/ViTModelZoo.

Paper Structure

This paper contains 29 sections, 3 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Overview of the model zoo generating factors. On the left, we show the two different training tasks, and multiple pre-training seeds. On the right, we show how each of the pre-trained models is fine-tuned using different configurations forming a hyperparameter grid.
  • Figure 2: Comparison of different weight averaging methods. Averaging over epochs generally increases performance, especially with increasing generalization gap. Averaging over multiple fine-tuned models that share all but one hyperparameter to vary leads to mixed results. Averaging over fine-tuning seeds can improve performance in some cases whereas the other parameters in almost all cases harm performance.
  • Figure 3: Longitudinal test accuracy of weight-averaged ViTs on CIFAR-100 over fine-tuning epochs. Models are averaged over the 5 previous training epochs. Averaging over fine-tuning epochs consistently improves performance after the training curve has bent towards its asymptote.