Zero-shot generalization across architectures for visual classification
Evan Gerritz, Luciano Dyballa, Steven W. Zucker
TL;DR
The paper tackles zero shot generalization to unseen visual classes within a minimalist domain of Chinese calligraphy, highlighting that high classification accuracy does not imply strong generalization. It introduces an embedding based zero shot framework and defines a generalization index $g$ based on the metric $g = \max_i \left\{ \mathrm{NMI}\left( \mathcal{C}^{i}_{\mathrm{unseen}}, \mathcal{C}^{\star} \right) \right\}$ to quantify how well unseen classes cluster in intermediate representations. Experiments across multiple architectures including Vision Transformers and CNNs reveal that generalization varies substantially across models and layer depth in a non monotonic way, with accuracy and generalization loosely coupled. The work provides a framework for measuring representation robustness and motivates future generalization driven objectives beyond standard classification loss.
Abstract
Generalization to unseen data is a key desideratum for deep networks, but its relation to classification accuracy is unclear. Using a minimalist vision dataset and a measure of generalizability, we show that popular networks, from deep convolutional networks (CNNs) to transformers, vary in their power to extrapolate to unseen classes both across layers and across architectures. Accuracy is not a good predictor of generalizability, and generalization varies non-monotonically with layer depth.
