ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

Athanasios Angelakis

ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

Athanasios Angelakis

TL;DR

Findings support the view that aligning architectural inductive bias with data structure can be more important than pursuing universal benchmark dominance, and introduce ZACH-ViT, a compact Vision Transformer that achieves competitive performance while maintaining sub-second inference times, supporting deployment in resource-constrained clinical environments.

Abstract

Vision Transformers rely on positional embeddings and class tokens that encode fixed spatial priors. While effective for natural images, these priors may hinder generalization when spatial layout is weakly informative or inconsistent, a frequent condition in medical imaging and edge-deployed clinical systems. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes both positional embeddings and the [CLS] token, achieving permutation invariance through global average pooling over patch representations. The term "Zero-token" specifically refers to removing the dedicated [CLS] aggregation token and positional embeddings; patch tokens remain unchanged and are processed normally. Adaptive residual projections preserve training stability in compact configurations while maintaining a strict parameter budget. Evaluation is performed across seven MedMNIST datasets spanning binary and multi-class tasks under a strict few-shot protocol (50 samples per class, fixed hyperparameters, five random seeds). The empirical analysis demonstrates regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves its strongest advantage on BloodMNIST and remains competitive with TransMIL on PathMNIST, while its relative advantage decreases on datasets with strong anatomical priors (OCTMNIST, OrganAMNIST), consistent with the architectural hypothesis. These findings support the view that aligning architectural inductive bias with data structure can be more important than pursuing universal benchmark dominance. Despite its minimal size and lack of pretraining, ZACH-ViT achieves competitive performance while maintaining sub-second inference times, supporting deployment in resource-constrained clinical environments. Code and models are available at https://github.com/Bluesman79/ZACH-ViT.

ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

TL;DR

Abstract

Paper Structure (23 sections, 3 equations, 11 figures, 3 tables)

This paper contains 23 sections, 3 equations, 11 figures, 3 tables.

Introduction
Related Work
Permutation-Invariant Vision Architectures
Distinction from MIL-based Transformers.
Positional Embedding Removal Across Domains
Method: ZACH-ViT
Architecture Overview
Adaptive Residual Projections
Experimental Protocol
Few-Shot Setting
Spatial Structure Spectrum and Dataset Selection
Model Families and Benchmarking
Implementation and Experimental Environment
Evaluation Metrics
Results on Medical Imaging Benchmarks
...and 8 more sections

Figures (11)

Figure 1: Regime spectrum analysis. Each point shows ZACH-ViT's advantage over the mean performance of other scratch-trained baselines on a dataset. Datasets are ordered by an ordinal spatial-structure strength index (1 = very weak, 5 = strong). The trend suggests that ZACH-ViT's relative advantage is larger when spatial layout is weakly informative and smaller when anatomical structure is fixed.
Figure 2: Global parameter efficiency across all evaluated models. ZACH-ViT competes with substantially larger pretrained architectures despite scratch training.
Figure 3: Parameter efficiency across individual MedMNIST datasets. Each subplot reports model performance versus parameter count within a specific spatial-structure regime. The star indicates ZACH-ViT. Results illustrate regime-dependent efficiency: ZACH-ViT shows strong competitiveness in weak-structure datasets, while pretrained models generally achieve higher performance in strongly structured anatomical regimes.
Figure 4: Generalization gap (Train--Test) for scratch models. ZACH-ViT exhibits consistently small gaps across datasets.
Figure 5: Inference time versus performance. ZACH-ViT occupies an efficient region of the accuracy--latency trade-off.
...and 6 more figures

ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

TL;DR

Abstract

ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

Authors

TL;DR

Abstract

Table of Contents

Figures (11)