CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis

Xiaoxiao Sun; Xingjian Leng; Zijian Wang; Yang Yang; Zi Huang; Liang Zheng

CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis

Xiaoxiao Sun, Xingjian Leng, Zijian Wang, Yang Yang, Zi Huang, Liang Zheng

TL;DR

This paper introduces CIFAR-10-Warehouse, consisting of 180 datasets collected by prompting image search engines and diffusion models in various ways, and aims to enhance the evaluation and deepen the understanding of two generalization tasks: domain generalization and model accuracy prediction in various out-of-distribution environments.

Abstract

Analyzing model performance in various unseen environments is a critical research problem in the machine learning community. To study this problem, it is important to construct a testbed with out-of-distribution test sets that have broad coverage of environmental discrepancies. However, existing testbeds typically either have a small number of domains or are synthesized by image corruptions, hindering algorithm design that demonstrates real-world effectiveness. In this paper, we introduce CIFAR-10-Warehouse, consisting of 180 datasets collected by prompting image search engines and diffusion models in various ways. Generally sized between 300 and 8,000 images, the datasets contain natural images, cartoons, certain colors, or objects that do not naturally appear. With CIFAR-10-W, we aim to enhance the evaluation and deepen the understanding of two generalization tasks: domain generalization and model accuracy prediction in various out-of-distribution environments. We conduct extensive benchmarking and comparison experiments and show that CIFAR-10-W offers new and interesting insights inherent to these tasks. We also discuss other fields that would benefit from CIFAR-10-W.

CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis

TL;DR

Abstract

Paper Structure (23 sections, 16 figures, 8 tables)

This paper contains 23 sections, 16 figures, 8 tables.

Introduction
Data Collection
Task I: Model Accuracy Prediction on Unlabeled Sets
Benchmarking Setup
Benchmarking Results and Main Observations
More Results and Findings
Task II: Domain Generalization
Benchmarking Setup
Benchmarking Results and Discussions
Other Fields That Potentially Benefit from CIFAR-10-W
Conclusion
Details and More Discussions of Datasets in CIFAR-10-W
Prompts and Data Sources
More Analysis of CIFAR-10-W
Diversity of datasets in CIFAR-10-W
...and 8 more sections

Figures (16)

Figure 1: Colors, sources, statistics, and examples of CIFAR-10-W. (A) Datasets of CIFAR-10-W are collected from 8 sources: 7 search engines (e.g., Google and Bing) and the diffusion model, where numbers after each source denote the number of datasets. We also depict color and style options used in prompting search and generation. (B) Distribution of the number of images for each category in datasets searched by keywords (KW), keywords plus cartoon (KWC) and diffusion under specific color conditions. (C) Sample images from different domains are shown.
Figure 2: Correlation studies. (A) Visualizing correlation between accuracy and prediction scores on CIFAR-10-W. We use ResNet44 classifier and Spearman's rank correlation $\rho$. ATC-MC (left) and MA-AoL (right) are used. (B) Relationship between accuracy and accuracy prediction error (MAE, %) on the CIFAR-10-Cs (left) and CIFAR-10-W (right) testbeds. Both use the MS-AoL method.
Figure 3: (A) Variance of MAE (%) caused by 40 classifiers. We compare the variance of different AccP methods on CIFAR-10-Cs and CIFAR-10-W. (B) Impact of the classifier training set: CIFAR-10 vs. CIFAR-10-F. (C) Impact of test category removal: the removed categories, deleted one or two at a time, are listed at the bottom. A positive change in MAE indicates worse performance and vice versa.
Figure 4: (A) Test set size on AccP methods and (B) Average and standard deviation of MAE values for each model of the 40 classifiers across 13 AccP methods. In (A), the test size is gradually reduced to 100 instances from the full dataset and the performance of methods is shown on both CIFAR-10-Cs and CIFAR-10-W. In (B), the easiest and hardest models to evaluate are indicated by green and red points, respectively. The ResNet44 classifier trained on CIFAR-10 is used.
Figure 5: (A) Impact of increasing the number of source domains on domain generalization. The ResNet-18 classifier is trained using the domain generalization technique SD with our searched training sets as the source domains. The density plot on the y-axis illustrates the density of the test set at various levels of improvement. On the x-axis, the density plot shows the distribution of accuracies achieved by the baseline method ERM on CIFAR-10-W datasets. (B) Effectiveness of accuracy prediction methods (nuclear norm and FD) on CIFAR-10-W. We evaluate the performance using the ResNet-18 model trained with two different approaches: the normally trained model (top) and the model trained with the domain generalization technique SD (bottom).
...and 11 more figures

CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis

TL;DR

Abstract

CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (16)