Vision Transformer Neural Architecture Search for Out-of-Distribution Generalization: Benchmark and Insights

Sy-Tuyen Ho; Tuan Van Vo; Somayeh Ebrahimkhani; Ngai-Man Cheung

Vision Transformer Neural Architecture Search for Out-of-Distribution Generalization: Benchmark and Insights

Sy-Tuyen Ho, Tuan Van Vo, Somayeh Ebrahimkhani, Ngai-Man Cheung

TL;DR

This work introduces OoD-ViT-NAS, the first benchmark and analysis framework assessing how Vision Transformer architectures impact Out-of-Distribution generalization. By sampling 3,000 ViT architectures from the Autoformer search space via One-Shot NAS and evaluating them on 8 OoD datasets, the study reveals that architectural design substantially influences OoD performance and that ID accuracy is a poor predictor of OoD success. It also shows that nine Training-free NAS proxies largely fail to predict OoD accuracy, with simple metrics like parameter count and FLOPs offering stronger signals. A key finding is that embedding dimension is the most influential ViT attribute for OoD generalization, enabling robust architectures that can outperform some SOTA OoD methods, thereby guiding future ViT design for real-world, distribution-shifted tasks.

Abstract

While ViTs have achieved across machine learning tasks, deploying them in real-world scenarios faces a critical challenge: generalizing under OoD shifts. A crucial research gap exists in understanding how to design ViT architectures, both manually and automatically, for better OoD generalization. To this end, we introduce OoD-ViT-NAS, the first systematic benchmark for ViTs NAS focused on OoD generalization. This benchmark includes 3000 ViT architectures of varying computational budgets evaluated on 8 common OoD datasets. Using this benchmark, we analyze factors contributing to OoD generalization. Our findings reveal key insights. First, ViT architecture designs significantly affect OoD generalization. Second, ID accuracy is often a poor indicator of OoD accuracy, highlighting the risk of optimizing ViT architectures solely for ID performance. Third, we perform the first study of NAS for ViTs OoD robustness, analyzing 9 Training-free NAS methods. We find that existing Training-free NAS methods are largely ineffective in predicting OoD accuracy despite excelling at ID accuracy. Simple proxies like Param or Flop surprisingly outperform complex Training-free NAS methods in predicting OoD accuracy. Finally, we study how ViT architectural attributes impact OoD generalization and discover that increasing embedding dimensions generally enhances performance. Our benchmark shows that ViT architectures exhibit a wide range of OoD accuracy, with up to 11.85% improvement for some OoD shifts. This underscores the importance of studying ViT architecture design for OoD. We believe OoD-ViT-NAS can catalyze further research into how ViT designs influence OoD generalization.

Vision Transformer Neural Architecture Search for Out-of-Distribution Generalization: Benchmark and Insights

TL;DR

Abstract

Paper Structure (32 sections, 32 figures, 15 tables)

This paper contains 32 sections, 32 figures, 15 tables.

Introduction
Related Work
OoD-ViT-NAS: NAS Benchmark for ViT's OoD Generalization
Investigation on Out-of-Distribution Generalization of ViT
ViT architecture designs have a considerable impact on OoD generalization
Can ID accuracy serve as a good indication for OoD accuracy?
Explore Training-free NAS for OoD Generalization
ViT Structural Attributes on OoD Generalization: Increasing Embedding Dimension is Generally Helpful
Experimental Setup.
Results.
Robust ViT architectures designed by our finding.
Conclusion
Appendix Overview
Limitations and Broader Impact
Analysis on Human-design ViT Search Space
...and 17 more sections

Figures (32)

Figure 1: We propose, OoD-ViT-NAS, the first comprehensive benchmark for NAS on OoD generalization of ViT architectures. Then, we comprehensively investigate OoD generalization for ViT. The detailed of $8$ OoD datasets in our investigation can be found in Tab. \ref{['tab:benchmark_summary']}. In this figure, we show the Kendall $\tau$ ranking correlation between OoD accuracy of different datasets on the left and different quantities at the bottom. Our analysis uncovers several key insights. (a) ID as an indicator for ViT OoD Generalization (Sec. \ref{['Sec:IDvsOoDAcc']}) We show that the correlation between ID accuracy and OoD accuracy is not very high. This suggests that current architectural insights based on ID accuracy might not translate well to OoD generalization. (b) Training-free NAS for ViT OoD Generalization. (Sec. \ref{['Sec:Proxies']}) We conduct the first study of NAS for ViT's OoD generalization, showing that their effectiveness significantly weakens in predicting OoD accuracy. (c) OoD Generalization ViT Architectural Attributes. (Sec. \ref{['Sec:Robust_Arch']}) Our first study on the impact of ViT architectural attributes on OoD generalization shows that the embedding dimension generally has the highest correlation with OoD accuracy among ViT architectural attributes. Additional results can be found in the Appx.
Figure 2: Our analysis of the OoD accuracy range highlights the significant influence of ViT architectural designs on OoD accuracy. (Sec. \ref{['sec:ood_accuracy_range']}) The numbers within each violin plot for each sub-figure (e.g., IN-D $9.79$ ($1.06$), $9.65$ ($2.25$), and $7.99$ ($0.56$)) denote the corresponding OoD (ID) accuracy range of architectures sampled from Autoformer-Tiny/Small/Base search space, respectively. See Appx. \ref{['Sec:Appx_OoDrange']} for additional plots and results on other OoD shifts. For a fair comparison, we fix the same range for the x-axis across all sub-figures. We include the ID accuracy range in the top-left sub-figure for reference. On average, the OoD accuracy across all shifts is $3.8\%$/$4.86\%$/$2.74\%$ for the search spaces in our OoD-ViT-NAS benchmark. This range is comparable to and even surpasses the current SOTA method based on domain-invariant representation learning bai2024hypo, which achieved a $1.9\%$ improvement in OoD accuracy under similar settings.
Figure 3: Visualization of OoD accuracy range across OoD shift severity. We conduct the analysis on $1,000$ architectures in Autoformer-Small search space within our OoD-NAS-ViT benchmarks. Level $0$ denotes the clean examples. All corruptions can be found in Fig. \ref{['fig:accuracy_severity-range_rest']}, in Appx. \ref{['Sec:Appx_OoDrange']}. We generally observe that the range of OoD accuracy widens as the severity of the OoD shift increases.
Figure 4: Analysis of OoD Generalization Performance of Pareto Architectures for ID accuracy. Blue dots represent architectures in the search space, while red dots represent the ID Pareto architectures. See Appx. \ref{['Sec:Appx_Pareto']} for additional results. We find that Pareto architectures for ID accuracy generally perform sub-optimally under OoD shift.
Figure 5: The effect of #Embed_Dim on robustness generalization of ViTs. The numbers denote the mean OoD accuracy across ViT architectures with specific colour-coded embedding dimensions and depths. The data points with blue , orange , and green colours represent ViT architectures with an embedding dimension of $320$, $384$, and $448$, respectively. Generally, a higher OoD accuracy is obtained when the embedding dimension of ViT architectures increases for most OoD shifts. See Fig. \ref{['fig:embedding-all-small1']} and \ref{['fig:embedding-all-small2']} in Appx. \ref{['Sec:Appx_ViTAttribute_Embedding']} for additional plots and results on other OoD shifts.
...and 27 more figures

Vision Transformer Neural Architecture Search for Out-of-Distribution Generalization: Benchmark and Insights

TL;DR

Abstract

Vision Transformer Neural Architecture Search for Out-of-Distribution Generalization: Benchmark and Insights

Authors

TL;DR

Abstract

Table of Contents

Figures (32)