Table of Contents
Fetching ...

Generalisation in humans and deep neural networks

Robert Geirhos, Carlos R. Medina Temme, Jonas Rauber, Heiko H. Schütt, Matthias Bethge, Felix A. Wichmann

TL;DR

The paper investigates how humans and state-of-the-art CNNs generalize to a broad suite of image distortions, revealing that humans exhibit markedly greater robustness and more uniform error distributions under degraded signals. By first evaluating pre-trained networks and then training networks directly on distorted images, the study shows that distortion-specific training yields high in-domain performance but fails to generalize to unseen distortions, highlighting a fundamental generalisation gap under distribution shifts. The authors introduce a large, carefully controlled 82,880-trial dataset and a 16-class ImageNet mapping to enable fair human-DNN comparisons and lifelong robustness benchmarking. The findings suggest that improving robustness will require approaches beyond standard data augmentation, potentially incorporating perceptual normalization and shape priors, and pave the way for healthier cross-disciplinary insights into human vision and machine perception.

Abstract

We compare the robustness of humans and current convolutional deep neural networks (DNNs) on object recognition under twelve different types of image degradations. First, using three well known DNNs (ResNet-152, VGG-19, GoogLeNet) we find the human visual system to be more robust to nearly all of the tested image manipulations, and we observe progressively diverging classification error-patterns between humans and DNNs when the signal gets weaker. Secondly, we show that DNNs trained directly on distorted images consistently surpass human performance on the exact distortion types they were trained on, yet they display extremely poor generalisation abilities when tested on other distortion types. For example, training on salt-and-pepper noise does not imply robustness on uniform white noise and vice versa. Thus, changes in the noise distribution between training and testing constitutes a crucial challenge to deep learning vision systems that can be systematically addressed in a lifelong machine learning approach. Our new dataset consisting of 83K carefully measured human psychophysical trials provide a useful reference for lifelong robustness against image degradations set by the human visual system.

Generalisation in humans and deep neural networks

TL;DR

The paper investigates how humans and state-of-the-art CNNs generalize to a broad suite of image distortions, revealing that humans exhibit markedly greater robustness and more uniform error distributions under degraded signals. By first evaluating pre-trained networks and then training networks directly on distorted images, the study shows that distortion-specific training yields high in-domain performance but fails to generalize to unseen distortions, highlighting a fundamental generalisation gap under distribution shifts. The authors introduce a large, carefully controlled 82,880-trial dataset and a 16-class ImageNet mapping to enable fair human-DNN comparisons and lifelong robustness benchmarking. The findings suggest that improving robustness will require approaches beyond standard data augmentation, potentially incorporating perceptual normalization and shape priors, and pave the way for healthier cross-disciplinary insights into human vision and machine perception.

Abstract

We compare the robustness of humans and current convolutional deep neural networks (DNNs) on object recognition under twelve different types of image degradations. First, using three well known DNNs (ResNet-152, VGG-19, GoogLeNet) we find the human visual system to be more robust to nearly all of the tested image manipulations, and we observe progressively diverging classification error-patterns between humans and DNNs when the signal gets weaker. Secondly, we show that DNNs trained directly on distorted images consistently surpass human performance on the exact distortion types they were trained on, yet they display extremely poor generalisation abilities when tested on other distortion types. For example, training on salt-and-pepper noise does not imply robustness on uniform white noise and vice versa. Thus, changes in the noise distribution between training and testing constitutes a crucial challenge to deep learning vision systems that can be systematically addressed in a lifelong machine learning approach. Our new dataset consisting of 83K carefully measured human psychophysical trials provide a useful reference for lifelong robustness against image degradations set by the human visual system.

Paper Structure

This paper contains 11 sections, 6 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Classification performance of ResNet-50 trained from scratch on (potentially distorted) ImageNet images. (a) Classification performance when trained on standard colour images and tested on colour images is close to perfect (better than human observers). (b) Likewise, when trained and tested on images with additive uniform noise, performance is super-human. (c) Striking generalisation failure: When trained on images with salt-and-pepper noise and tested on images with uniform noise, performance is at chance level---even though both noise types do not seem much different to human observers.
  • Figure 2: Example stimulus image of class bird across all distortion types. From left to right, image manipulations are: colour (undistorted), greyscale, low contrast, high-pass, low-pass (blurring), phase noise, power equalisation. Bottom row: opponent colour, rotation, Eidolon I, II and III, additive uniform noise, salt-and-pepper noise. Example stimulus images across all used distortion levels are available in the supplementary material.
  • Figure 3: Classification accuracy and response distribution entropy for GoogLeNet, VGG-19 and ResNet-152 as well as for human observers. 'Entropy' indicates the Shannon entropy of the response/decision distribution (16 classes). It here is a measure of bias towards certain categories: using a test dataset that is balanced with respect to the number of images per category, responding equally frequently with all 16 categories elicits the maximum possible entropy of four bits. If a network or observer responds prefers some categories over others, entropy decreases (down to zero bits in the extreme case of responding with one particular category all the time, irrespective of the ground truth category). Human 'error bars' indicate the full range of results across participants. Image manipulations are explained in Section \ref{['meth:image_manipulations']} and visualised in Figures \ref{['fig:stimuli_noise_contrast']}, \ref{['fig:stimuli_lowpass_highpass']}, \ref{['fig:stimuli_eidolon_I_II']}, \ref{['fig:stimuli_eidolon_III_phase_scrambling_false_colour_power_equalisation']} and \ref{['fig:stimuli_salt_and_pepper_noise']}.
  • Figure 4: Classification accuracy (in percent) for networks with potentially distorted training data. Rows show different test conditions at an intermediate difficulty (exact condition indicated in brackets, units as in Figure \ref{['fig:results_accuracy_entropy']}). Columns correspond to differently trained networks (leftmost column: human observers for comparison; no human data available for salt-and-pepper noise). All of the networks were trained from scratch on (a potentially manipulated version of) 16-class-ImageNet. Manipulations included in the training data are indicated by a red rectangle; additionally 'greyscale' is underlined if it was part of the training data because a certain distortion encompasses greyscale images at full contrast. Models A1 to A9: ResNet-50 trained on a single distortion (100 epochs). Models B1 to B9: ResNet-50 trained on uniform noise plus one other distortion (200 epochs). Models C1 & C2: ResNet-50 trained on all but one distortion (200 epochs). Chance performance is at $\frac{1}{16}=6.25\%$ accuracy.
  • Figure 5: Schematic of a trial. After the presentation of a central fixation square (300 ms), the image was visible for 200 ms, followed immediately by a noise mask with 1/f spectrum (200 ms). Then, a response screen appeared for 1500 ms, during which the observer clicked on a category. Note that we increased the contrast of the noise mask in this figure for better visibility when printed. Categories row-wise from top to bottom: knife, bicycle, bear, truck, airplane, clock, boat, car, keyboard, oven, cat, bird, elephant, chair, bottle, dog. The icons are a modified version of the ones from the MS COCO website (http://mscoco.org/explore/).
  • ...and 9 more figures