Table of Contents
Fetching ...

Artificial Neural Nets and the Representation of Human Concepts

Timo Freiesleben

TL;DR

This work questions the prevailing claim that artificial neural networks store and operate on human concepts within individual units. By introducing coactivation and functional role as criteria for concept representation, it surveys evidence from transfer learning, TCAV, and adversarial examples to assess whether ANNs learn human concepts and how they store them. The author concludes that ANNs do learn concepts and can perform complex tasks, but the evidence for single-unit, human-concept storage is weak and often mixed, pointing toward distributed representations and context-dependent features, including non-human concepts. The discussion highlights methodological risks, advocates for falsifiable hypotheses, and urges exploration beyond supervised learning to better understand when and how concepts emerge in AI systems and their implications for interpretability and safety.

Abstract

What do artificial neural networks (ANNs) learn? The machine learning (ML) community shares the narrative that ANNs must develop abstract human concepts to perform complex tasks. Some go even further and believe that these concepts are stored in individual units of the network. Based on current research, I systematically investigate the assumptions underlying this narrative. I conclude that ANNs are indeed capable of performing complex prediction tasks, and that they may learn human and non-human concepts to do so. However, evidence indicates that ANNs do not represent these concepts in individual units.

Artificial Neural Nets and the Representation of Human Concepts

TL;DR

This work questions the prevailing claim that artificial neural networks store and operate on human concepts within individual units. By introducing coactivation and functional role as criteria for concept representation, it surveys evidence from transfer learning, TCAV, and adversarial examples to assess whether ANNs learn human concepts and how they store them. The author concludes that ANNs do learn concepts and can perform complex tasks, but the evidence for single-unit, human-concept storage is weak and often mixed, pointing toward distributed representations and context-dependent features, including non-human concepts. The discussion highlights methodological risks, advocates for falsifiable hypotheses, and urges exploration beyond supervised learning to better understand when and how concepts emerge in AI systems and their implications for interpretability and safety.

Abstract

What do artificial neural networks (ANNs) learn? The machine learning (ML) community shares the narrative that ANNs must develop abstract human concepts to perform complex tasks. Some go even further and believe that these concepts are stored in individual units of the network. Based on current research, I systematically investigate the assumptions underlying this narrative. I conclude that ANNs are indeed capable of performing complex prediction tasks, and that they may learn human and non-human concepts to do so. However, evidence indicates that ANNs do not represent these concepts in individual units.
Paper Structure (31 sections, 3 figures)

This paper contains 31 sections, 3 figures.

Figures (3)

  • Figure 1: An illustration of transfer learning by mukhlif2023incorporating. In this case, filters of a CNN are transferred from one to another task. The fact that this strategy works means that models have learned general and reliable concepts.
  • Figure 2: One illustration of an adversarial example by goodfellow2014explaining. An image classification model misclassifies a panda as a gibbon after adding noise on the image. This indicates that the model relies in its prediction on some features that humans do not rely on.
  • Figure 3: olah2017featureolah2020zoom applied activation maximization three times to one single unit of an ANN with slightly different initializations and objectives. Humans may find some semantically meaningful structures in the images (e.g. cat heads, car fronts, and bee bodies), but this only works for some units and can be conflicting as in this case. olah2017feature calls this neuron polysemantic, but how can we be sure that even there is any semantics based on these non-representative examples?