Zero-shot generalization across architectures for visual classification

Evan Gerritz; Luciano Dyballa; Steven W. Zucker

Zero-shot generalization across architectures for visual classification

Evan Gerritz, Luciano Dyballa, Steven W. Zucker

TL;DR

The paper tackles zero shot generalization to unseen visual classes within a minimalist domain of Chinese calligraphy, highlighting that high classification accuracy does not imply strong generalization. It introduces an embedding based zero shot framework and defines a generalization index $g$ based on the metric $g = \max_i \left\{ \mathrm{NMI}\left( \mathcal{C}^{i}_{\mathrm{unseen}}, \mathcal{C}^{\star} \right) \right\}$ to quantify how well unseen classes cluster in intermediate representations. Experiments across multiple architectures including Vision Transformers and CNNs reveal that generalization varies substantially across models and layer depth in a non monotonic way, with accuracy and generalization loosely coupled. The work provides a framework for measuring representation robustness and motivates future generalization driven objectives beyond standard classification loss.

Abstract

Generalization to unseen data is a key desideratum for deep networks, but its relation to classification accuracy is unclear. Using a minimalist vision dataset and a measure of generalizability, we show that popular networks, from deep convolutional networks (CNNs) to transformers, vary in their power to extrapolate to unseen classes both across layers and across architectures. Accuracy is not a good predictor of generalizability, and generalization varies non-monotonically with layer depth.

Zero-shot generalization across architectures for visual classification

TL;DR

based on the metric

to quantify how well unseen classes cluster in intermediate representations. Experiments across multiple architectures including Vision Transformers and CNNs reveal that generalization varies substantially across models and layer depth in a non monotonic way, with accuracy and generalization loosely coupled. The work provides a framework for measuring representation robustness and motivates future generalization driven objectives beyond standard classification loss.

Abstract

Paper Structure (6 sections, 5 figures, 1 table)

This paper contains 6 sections, 5 figures, 1 table.

Introduction
Methods & results
Conclusion
Appendix
Alternative generalization metric
CIFAR-100 dataset

Figures (5)

Figure 1: Generalizability and accuracy are loosely coupled. (a) An overview of our method of assessing generalizability through out-of-sample embeddings using intermediate layers, visualized using PCA. Embeddings from different hidden states of the Vision Transformer (ViT) produce widely varying results. Color labels indicate ground truth: clustered unseen classes indicate better generalization ($g$). (b) Results for ViT (top) and ConvNeXtV2 (bottom). Across epochs, test-set accuracy monotonically increases while generalizability may plateau or even decrease (left). Across layers, there are no predictable trends (right). For additional results, see Appendix \ref{['lbl:appendix']}.
Figure 2: Subset of dataset consisting of 128 individual characters drawn by 20 different calligraphers. We invite the reader to identify characters that appear to be have been written by the same person, and to noticing the nuances of this task. Fine-tuned networks can easily perform this task with high accuracy for artists on which they have been trained (Table \ref{['table:network_comparison']}).
Figure 3: Embedding of 10 calligraphers (including in-sample and out-of-sample) obtained via ResNet and visualized using PCA, with representative character images for each of the ground-truth clusters shown. The images were chosen by computing the centroid of each cluster and selecting those corresponding to the 10 nearest points.
Figure 4: Generalization across fine-tuning epochs for all studied networks. We plot $g_{\mathrm{seen}}$ and test-set accuracy (using seen classes) for reference. Notice how $g_{\mathrm{seen}}$ is higher for all networks, as expected. This demonstrates that our $g$ metric is able to capture the ability of the networks to generalize to unseen classes.
Figure 5: The generalizability trend across layers may vary considerably from model to model. Although deeper layers tend to better separate the examples from unseen classes, $g_i$ does not increase monotonically with depth, and for most models (ResNet, ViT, PViT, and PoolFormer) the best generalization is found at intermediary layers. This hold true for both datasets used for fine-tuning (calligraphy or CIFAR), showing that this phenomenon does not seem dataset-specific. Moreover, similar trends were found regardless of the metric used (NMI or $k$-nearest neighbors) as demonstrated by qualitatively similar curve shapes, despite being based on completely different methods. In almost all cases, the two metrics identified the same layer as the one that best generalized to the unseen classes. Interestingly, the $g_i$ curves are also qualitatively similar across datasets, indicating that the patterns observed are due to the network's architecture, and not the dataset.

Zero-shot generalization across architectures for visual classification

TL;DR

Abstract

Zero-shot generalization across architectures for visual classification

Authors

TL;DR

Abstract

Table of Contents

Figures (5)