Table of Contents
Fetching ...

Comparing supervised learning dynamics: Deep neural networks match human data efficiency but show a generalisation lag

Lukas S. Huber, Fred W. Mast, Felix A. Wichmann

TL;DR

This study addresses how humans and deep neural networks (DNNs) acquire representations for novel objects under tightly matched supervised learning conditions. It introduces a psychophysical paradigm with six training epochs and naturalistic 3D object stimuli to track learning dynamics from initial absence to generalization, comparing training/test performance, data efficiency, and generalisation. The findings show that DNNs can match human data efficiency but exhibit a pronounced generalisation lag, whereas humans generalize immediately without a pretraining-to-generalization phase. This suggests that achieving representational alignment requires promoting immediate generalization in neural networks, not just improving data efficiency, with implications for designing learning regimes that better mimic human cognition.

Abstract

Recent research has seen many behavioral comparisons between humans and deep neural networks (DNNs) in the domain of image classification. Often, comparison studies focus on the end-result of the learning process by measuring and comparing the similarities in the representations of object categories once they have been formed. However, the process of how these representations emerge -- that is, the behavioral changes and intermediate stages observed during the acquisition -- is less often directly and empirically compared. Here we report a detailed investigation of the learning dynamics in human observers and various classic and state-of-the-art DNNs. We develop a constrained supervised learning environment to align learning-relevant conditions such as starting point, input modality, available input data and the feedback provided. Across the whole learning process we evaluate and compare how well learned representations can be generalized to previously unseen test data. Comparisons across the entire learning process indicate that DNNs demonstrate a level of data efficiency comparable to human learners, challenging some prevailing assumptions in the field. However, our results also reveal representational differences: while DNNs' learning is characterized by a pronounced generalisation lag, humans appear to immediately acquire generalizable representations without a preliminary phase of learning training set-specific information that is only later transferred to novel data.

Comparing supervised learning dynamics: Deep neural networks match human data efficiency but show a generalisation lag

TL;DR

This study addresses how humans and deep neural networks (DNNs) acquire representations for novel objects under tightly matched supervised learning conditions. It introduces a psychophysical paradigm with six training epochs and naturalistic 3D object stimuli to track learning dynamics from initial absence to generalization, comparing training/test performance, data efficiency, and generalisation. The findings show that DNNs can match human data efficiency but exhibit a pronounced generalisation lag, whereas humans generalize immediately without a pretraining-to-generalization phase. This suggests that achieving representational alignment requires promoting immediate generalization in neural networks, not just improving data efficiency, with implications for designing learning regimes that better mimic human cognition.

Abstract

Recent research has seen many behavioral comparisons between humans and deep neural networks (DNNs) in the domain of image classification. Often, comparison studies focus on the end-result of the learning process by measuring and comparing the similarities in the representations of object categories once they have been formed. However, the process of how these representations emerge -- that is, the behavioral changes and intermediate stages observed during the acquisition -- is less often directly and empirically compared. Here we report a detailed investigation of the learning dynamics in human observers and various classic and state-of-the-art DNNs. We develop a constrained supervised learning environment to align learning-relevant conditions such as starting point, input modality, available input data and the feedback provided. Across the whole learning process we evaluate and compare how well learned representations can be generalized to previously unseen test data. Comparisons across the entire learning process indicate that DNNs demonstrate a level of data efficiency comparable to human learners, challenging some prevailing assumptions in the field. However, our results also reveal representational differences: while DNNs' learning is characterized by a pronounced generalisation lag, humans appear to immediately acquire generalizable representations without a preliminary phase of learning training set-specific information that is only later transferred to novel data.
Paper Structure (22 sections, 1 equation, 8 figures, 2 tables)

This paper contains 22 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Naturalistic novel objects were generated to serve as the basis for the creation of stimuli employed in the learning task. (a) Taxonomy diagram illustrating the genesis of novel objects, originating from an icosahedron and spanning two generations, with the second generation serving as the basis for our stimuli. Objects sharing the same parent in the generation process are assigned to the same category. (b) Displayed is a set of twelve distinct renderings of an object from the category "Puns", showcasing 30-degree pitch rotations (x-axis) from its initial position (indicated by "IV"). Panel (c) shows a blueprint of the learning task. There are six epochs each consisting of 36 training trials (green) and 51 test trials (blue). All test sets contain different images to ensure novelty. During training, participants received corrective feedback mimicking supervised learning in DNN training. Depending on whether the given response was correct or not, the chosen category would either light up in green or red for one second. If the response was incorrect, this would be followed by a screen displaying the correct response for one second. In test trials, no feedback is provided, aiming to restrict supervised learning to the training phase.
  • Figure 2: Inter-individual differences in human learning dynamics. For each participant we show moving averages of classification accuracy across the whole training (different trajectories mark different window sizes). Early learning is separately displayed in the small plots (purple trajectories). In these plots dashed lines mark binomial (Clopper-Pearson) confidence intervals around chance performance. We reason that if the trajectory does not exceed the upper bound of this confidence interval, this indicates that the observer does start training without any prior knowledge. The larger plots illustrate the evolution of training performance throughout the entire training period, as depicted by moving averages of varying sizes (different shades of pink). Here, dashed lines mark a confidence interval around the mean performance of individual observers. If the performance was significantly below the mean at the start, and is significantly higher at the end of training, this indicates that learning has occurred. Black circles with white crosses mark excluded participants because they either did not start learning from scratch (Participant 4), or did not show any indication of learning (Participants 3--6, and 11).
  • Figure 3: Observed learning dynamics indicate that both, humans and DNNs, learn novel generalizable representations from limited amounts of training data. However, while humans immediately form generalizable representations, DNNs show a pronounced generalisation lag. Different plots show training (dashed lines) and test (solid lines) performance trajectories in terms of classification accuracy for all classic CNNs (teal) and SOTA models (yellow). The performance of models is averaged across 20 fine-tuning runs (see Appendix \ref{['app:models']} for individual runs). In each plot, model performance is contrasted to the mean performance of human observers (pink). Detailed information on the learning dynamics of individual participants can be found in Figure \ref{['fig:moving_avrg']} (training) and in Appendix \ref{['app:humans']} (test). Shaded areas designate binomial confidence intervals (Clopper-Pearson) around chance performance for training sets---accuracies exceeding the upper bound suggest performance significantly above chance.
  • Figure 4: DNNs are not inherently less data efficient compared to humans. Here we quantify data efficiency as the mean test accuracy gain per training image across epochs. All observers (humans, classic CNNs, SOTA models) show similar data efficiency across learning, challenging the prevailing assumption that DNNs are inherently less data efficient than humans.
  • Figure 5: Better performance does not imply lower generlisation lag. Generalisation lag ($\Delta G$) is plotted as a function of ImageNet accuracy and number of parameters optimised during training (circle area). Top-1 ImageNet accuracy for humans is somewhat difficult to estimate from available studies since different measures are reported: 92.9--99% top-1 accuracy for entry-level categories dodge2017studygeirhos2018imagenet, 84.9% top-5 accuracy on ImageNet-1k russakovsky2015imagenet, and up to 97.3% multilabel-accuracy shankar2020evaluating. Therefore, we designate human top-1 ImageNet accuracy with a rather conservative lower and upper bound of 80--90%.
  • ...and 3 more figures