What Variables Affect Out-of-Distribution Generalization in Pretrained Models?

Md Yousuf Harun; Kyungbok Lee; Jhair Gallardo; Giri Krishnan; Christopher Kanan

What Variables Affect Out-of-Distribution Generalization in Pretrained Models?

Md Yousuf Harun, Kyungbok Lee, Jhair Gallardo, Giri Krishnan, Christopher Kanan

TL;DR

It is identified that training with high-resolution datasets containing many classes greatly reduces representation compression and improves transferability, and the results emphasize the danger of generalizing findings from toy datasets to broader contexts.

Abstract

Embeddings produced by pre-trained deep neural networks (DNNs) are widely used; however, their efficacy for downstream tasks can vary widely. We study the factors influencing transferability and out-of-distribution (OOD) generalization of pre-trained DNN embeddings through the lens of the tunnel effect hypothesis, which is closely related to intermediate neural collapse. This hypothesis suggests that deeper DNN layers compress representations and hinder OOD generalization. Contrary to earlier work, our experiments show this is not a universal phenomenon. We comprehensively investigate the impact of DNN architecture, training data, image resolution, and augmentations on transferability. We identify that training with high-resolution datasets containing many classes greatly reduces representation compression and improves transferability. Our results emphasize the danger of generalizing findings from toy datasets to broader contexts.

What Variables Affect Out-of-Distribution Generalization in Pretrained Models?

TL;DR

Abstract

Paper Structure (71 sections, 1 equation, 25 figures, 13 tables)

This paper contains 71 sections, 1 equation, 25 figures, 13 tables.

Introduction
Related Work
The Tunnel Effect
Learning Embeddings that Generalize
Methods
Measuring the Tunnel Effect
Variables Investigated for Their Role in OOD Generalization
Augmentation.
Number of Classes.
Number of Samples.
Resolution.
DNN Architecture Variables.
Datasets
ID Datasets.
OOD Datasets.
...and 56 more sections

Figures (25)

Figure 1: The tunnel effect. The tunnel impedes OOD generalization, which we study using linear probes trained on ID and OOD datasets for each layer. In this example, identical VGGm-17 architectures are trained on identical ID datasets, where only the resolution is changed. Probe accuracy on OOD datasets decreases once the tunnel is reached (denoted by $\medwhitestar$), where the model trained on low-resolution ($32\times 32$) images creates a longer tunnel (layers 9-16) than the one (layers 13-16) trained on higher-resolution ($224\times 224$) images. The Y-axis shows the normalized accuracy. The OOD curve is the average of 8 OOD datasets (Sec. \ref{['subsec:ood_datasets']}), with the standard deviation denoted with shading.
Figure 2: SHAP Results. SHAP slope shows the individual contribution of variables to various targets. Positive values indicate enhanced OOD generalization, and vice-versa for negative values.
Figure 3: Augmentation greatly reduces the tunnel effect. In (a), augmentation shifts the tunnel from layer 14 to 22, and in (b) from block 11 to 15. The OOD curve is the average of 8 OOD datasets with a shaded area indicating a 95% confidence interval. $\medwhitestar$ denotes the start of the tunnel.
Figure 4: High-resolution model does not exhibit representation compression. The t-SNE comparison between VGGm-11 models trained on low- (1st row) and high-resolution (2nd row) images of the same ID dataset (ImageNet-100) in an augmentation-free setting. Layer 8 marks the start of the tunnel in VGGm-11 trained on $32\times 32$ images whereas $224\times 224$ resolution does not create any tunnel. Layer 10 is the penultimate layer. The tunnel layers (layers 8-10) progressively compress representations for $32\times 32$ resolution whereas corresponding layers for $224\times 224$ resolution do not exhibit similar compression. For clarity, we show 5 classes from ImageNet-100 and indicate each class by a distinct color. The formation of distinct clusters in the $32\times32$ model is indicative of representation compression and intermediate neural collapse rangamani2023feature, which impairs OOD generalization.
Figure 5: The tunnel effect is not universal. In (a), VGGm-11 consisting of max-pool in all 5 stages ($\phi=0.5$), creates a tunnel (layers 7-10, gray-shaded area). In (b), the same VGGm-11 without max-pool in the first 2 stages ($\phi=1$, called VGGm$\dag$-11), eliminates the tunnel for all OOD datasets.
...and 20 more figures

What Variables Affect Out-of-Distribution Generalization in Pretrained Models?

TL;DR

Abstract

What Variables Affect Out-of-Distribution Generalization in Pretrained Models?

Authors

TL;DR

Abstract

Table of Contents

Figures (25)