Table of Contents
Fetching ...

Insights from the Use of Previously Unseen Neural Architecture Search Datasets

Rob Geada, David Towers, Matthew Forshaw, Amir Atapour-Abarghouei, A. Stephen McGough

TL;DR

This paper tackles the problem of NAS generalization by introducing eight unseen NAS datasets spanning Type-1 and Type-2 tasks to stress-test models on problems unknown at training time. It benchmarks standard CNN baselines and NAS methods (PC-DARTS, DrNAS, Bonsai-Net) as well as random search, revealing mixed generalization performance across datasets. The key contributions are the dataset descriptions, splits, and baseline results, plus a discussion of generalization limits and the need for more unseen benchmarks. Overall, the work argues for diversifying NAS evaluation beyond CIFAR-10 and ImageNet to better assess real-world applicability and generalization potential.

Abstract

The boundless possibility of neural networks which can be used to solve a problem -- each with different performance -- leads to a situation where a Deep Learning expert is required to identify the best neural network. This goes against the hope of removing the need for experts. Neural Architecture Search (NAS) offers a solution to this by automatically identifying the best architecture. However, to date, NAS work has focused on a small set of datasets which we argue are not representative of real-world problems. We introduce eight new datasets created for a series of NAS Challenges: AddNIST, Language, MultNIST, CIFARTile, Gutenberg, Isabella, GeoClassing, and Chesseract. These datasets and challenges are developed to direct attention to issues in NAS development and to encourage authors to consider how their models will perform on datasets unknown to them at development time. We present experimentation using standard Deep Learning methods as well as the best results from challenge participants.

Insights from the Use of Previously Unseen Neural Architecture Search Datasets

TL;DR

This paper tackles the problem of NAS generalization by introducing eight unseen NAS datasets spanning Type-1 and Type-2 tasks to stress-test models on problems unknown at training time. It benchmarks standard CNN baselines and NAS methods (PC-DARTS, DrNAS, Bonsai-Net) as well as random search, revealing mixed generalization performance across datasets. The key contributions are the dataset descriptions, splits, and baseline results, plus a discussion of generalization limits and the need for more unseen benchmarks. Overall, the work argues for diversifying NAS evaluation beyond CIFAR-10 and ImageNet to better assess real-world applicability and generalization potential.

Abstract

The boundless possibility of neural networks which can be used to solve a problem -- each with different performance -- leads to a situation where a Deep Learning expert is required to identify the best neural network. This goes against the hope of removing the need for experts. Neural Architecture Search (NAS) offers a solution to this by automatically identifying the best architecture. However, to date, NAS work has focused on a small set of datasets which we argue are not representative of real-world problems. We introduce eight new datasets created for a series of NAS Challenges: AddNIST, Language, MultNIST, CIFARTile, Gutenberg, Isabella, GeoClassing, and Chesseract. These datasets and challenges are developed to direct attention to issues in NAS development and to encourage authors to consider how their models will perform on datasets unknown to them at development time. We present experimentation using standard Deep Learning methods as well as the best results from challenge participants.
Paper Structure (21 sections, 3 figures, 3 tables)

This paper contains 21 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Left: AddNIST image - the sum of the channels adds up to 15 (r = 2, g = 9, b = 4), implying a label of 14. Middle MultNIST image - the product of the channels equals 35 (r = 5, g = 7, b = 1), which implies a label of 5 (35 %10 = 5). Right CIFARTile image - two deer, one aeroplane, and one horse, meaning there are three unique labels among the sub-images, which equates to a final label of 2
  • Figure 2: Top: An example of a Language image with a readable axis included. The right-hand axis contains the four six-letter words "Uvulas", "Minted", "Suckle", and "Debits", which are all English words given the label 0. Bottom: An example Gutenberg image, the words "their", "spring", and "their" have been encoded from one of Shakespeare's works which give the label 4
  • Figure 3: Left: An example of an Isabella generated using a piece of music labelled as "20th Century"(0). Middle An example of the GeoClassing dataset showing a photo taken over Portugal (9). Right: An example rendering of a board position in the Chesseract dataset wherein white goes on to eventually win and is thus given a label of White Wins (0)