A Model Zoo of Vision Transformers

Damian Falk; Léo Meynent; Florence Pfammatter; Konstantin Schürholt; Damian Borth

A Model Zoo of Vision Transformers

Damian Falk, Léo Meynent, Florence Pfammatter, Konstantin Schürholt, Damian Borth

TL;DR

Addresses the lack of transformer-focused model zoos in computer vision by building the first ViT-based model zoo with $250$ models generated via a two-stage training framework, formalized as $(\mathcal{A}_{P}, \mathcal{D}_{P}, \lambda_{P})$ for pre-training and $(\mathcal{A}_{F}, \mathcal{D}_{F}, \lambda_{F})$ for fine-tuning. The authors analyze weight-space and behavioral diversity across a structured hyperparameter grid, reveal distinct modes tied to pre-training and fine-tuning choices, and demonstrate initial applications such as model lineage prediction with MoTHeR and weight averaging through model soups and epoch averaging, as well as Git Re-Basin limitations. The zoo, publicly released, provides a resource to scale population-based methods to state-of-the-art transformer architectures and to study learning dynamics, representation transfer, and robustness in vision transformers. Overall, the work establishes a benchmark dataset and a methodological platform for transformer-population research in vision, enabling systematic exploration of how different training factors shape large ViT populations.

Abstract

The availability of large, structured populations of neural networks - called 'model zoos' - has led to the development of a multitude of downstream tasks ranging from model analysis, to representation learning on model weights or generative modeling of neural network parameters. However, existing model zoos are limited in size and architecture and neglect the transformer, which is among the currently most successful neural network architectures. We address this gap by introducing the first model zoo of vision transformers (ViT). To better represent recent training approaches, we develop a new blueprint for model zoo generation that encompasses both pre-training and fine-tuning steps, and publish 250 unique models. They are carefully generated with a large span of generating factors, and their diversity is validated using a thorough choice of weight-space and behavioral metrics. To further motivate the utility of our proposed dataset, we suggest multiple possible applications grounded in both extensive exploratory experiments and a number of examples from the existing literature. By extending previous lines of similar work, our model zoo allows researchers to push their model population-based methods from the small model regime to state-of-the-art architectures. We make our model zoo available at github.com/ModelZoos/ViTModelZoo.

A Model Zoo of Vision Transformers

TL;DR

Abstract

A Model Zoo of Vision Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)