Table of Contents
Fetching ...

A Number Sense as an Emergent Property of the Manipulating Brain

Neehar Kondapaneni, Pietro Perona

TL;DR

A model in which spontaneous and undirected manipulation of small objects trains perception to predict the resulting scene changes acquires the ability to estimate the number of objects in the scene, and concludes that important aspects of a facility with numbers and quantities may be learned without explicit teacher supervision.

Abstract

The ability to understand and manipulate numbers and quantities emerges during childhood, but the mechanism through which humans acquire and develop this ability is still poorly understood. We explore this question through a model, assuming that the learner is able to pick up and place small objects from, and to, locations of its choosing, and will spontaneously engage in such undirected manipulation. We further assume that the learner's visual system will monitor the changing arrangements of objects in the scene and will learn to predict the effects of each action by comparing perception with a supervisory signal from the motor system. We model perception using standard deep networks for feature extraction and classification, and gradient descent learning. Our main finding is that, from learning the task of action prediction, an unexpected image representation emerges exhibiting regularities that foreshadow the perception and representation of numbers and quantity. These include distinct categories for zero and the first few natural numbers, a strict ordering of the numbers, and a one-dimensional signal that correlates with numerical quantity. As a result, our model acquires the ability to estimate numerosity, i.e. the number of objects in the scene, as well as subitization, i.e. the ability to recognize at a glance the exact number of objects in small scenes. Remarkably, subitization and numerosity estimation extrapolate to scenes containing many objects, far beyond the three objects used during training. We conclude that important aspects of a facility with numbers and quantities may be learned with supervision from a simple pre-training task. Our observations suggest that cross-modal learning is a powerful learning mechanism that may be harnessed in artificial intelligence.

A Number Sense as an Emergent Property of the Manipulating Brain

TL;DR

A model in which spontaneous and undirected manipulation of small objects trains perception to predict the resulting scene changes acquires the ability to estimate the number of objects in the scene, and concludes that important aspects of a facility with numbers and quantities may be learned without explicit teacher supervision.

Abstract

The ability to understand and manipulate numbers and quantities emerges during childhood, but the mechanism through which humans acquire and develop this ability is still poorly understood. We explore this question through a model, assuming that the learner is able to pick up and place small objects from, and to, locations of its choosing, and will spontaneously engage in such undirected manipulation. We further assume that the learner's visual system will monitor the changing arrangements of objects in the scene and will learn to predict the effects of each action by comparing perception with a supervisory signal from the motor system. We model perception using standard deep networks for feature extraction and classification, and gradient descent learning. Our main finding is that, from learning the task of action prediction, an unexpected image representation emerges exhibiting regularities that foreshadow the perception and representation of numbers and quantity. These include distinct categories for zero and the first few natural numbers, a strict ordering of the numbers, and a one-dimensional signal that correlates with numerical quantity. As a result, our model acquires the ability to estimate numerosity, i.e. the number of objects in the scene, as well as subitization, i.e. the ability to recognize at a glance the exact number of objects in small scenes. Remarkably, subitization and numerosity estimation extrapolate to scenes containing many objects, far beyond the three objects used during training. We conclude that important aspects of a facility with numbers and quantities may be learned with supervision from a simple pre-training task. Our observations suggest that cross-modal learning is a powerful learning mechanism that may be harnessed in artificial intelligence.

Paper Structure

This paper contains 29 sections, 4 figures.

Figures (4)

  • Figure 1: Schematics of our model.(A) (Left-to-right) A sequence of actions modifies the visual scene over time. (B) (Bottom-to-top) The scene changes as a result of manipulation. The images $x_t$ and $x_{t+1}$ of the scene before and after manipulation are mapped by perception into representations $z_t$ and $z_{t+1}$. These are compared by a classifier to predict which action took place. Learning monitors the error between predicted action and a signal from the motor system representing the actual action, and updates simultaneously the weights of both perception and the classifier to increase prediction accuracy. (C) (Bottom-to-top) Our model of perception is a hybrid neural network composed of the concatenation of a convolutional neural network (CNN) with a fully-connected network (FCN 1). The classifier is implemented by a fully connected network (FCN 2) which compares the two representations $z_t$ and $z_{t+1}$. The two perception networks are actually the same network operating on distinct images and therefore their parameters are identical and learned simultaneously in a Siamese network configuration bromley1994signature. Details of the models are given in Fig. \ref{['fig:network_details']}.
  • Figure 3: Action classification performance. The network accurately classifies actions up to the training limit of three objects, regardless of the statistics of the data (the x axis indicates the number of objects in the scene before the action takes place). Error increases when the number of objects in the test images exceeds the number of objects in the training set. 95% Bayesian confidence intervals are shown by the shaded areas (272 $\leq$ N $\leq$ 386). The gray region highlights test cases where the number of objects exceeds the number in the training set. The dashed red line indicates chance level.
  • Figure 4: The embedding space for Model B. To explore the structure of the embedding space, we generated a dataset with $\left\{ 0 \dots 30 \right\}$ objects, extending the number of objects far beyond the limit of 3 objects in the training task. Each image in the dataset was passed through Model B and the output (the internal representation/embedding) of the image is shown. See Fig. \ref{['fig:embedding_space_a']} for Model A. (A) Each dot indicates an image embedding and the embeddings happen to be arranged along a line. The number of objects in each image is color coded. The smooth gradation of the color suggests that the embeddings are arranged monotonically with respect to the number of objects in the corresponding image. The inset shows that the embeddings of the images that contain only a few objects are arranged along the line into "islands". (B) We apply an unsupervised clustering algorithm to the embeddings. Each cluster that is discovered is denoted by a specific color. The cluster X, denoted by black crosses, indicates points that the clustering algorithm excluded as outliers. (C) The confusion matrix shows that the clusters that are found by the clustering algorithm correspond to numbers. Images containing 0 - 6 objects are neatly separated into individual clusters; after that images are collected into a large group that is not in one-to-one correspondence with the number of objects in the image. The color scale is logarithmic (base 10).
  • Figure 5: Relative and absolute estimation of quantity.(A) Two images may be compared for quantity burr2008visual by computing their embedding and observing their position along our model's embedding line: the image that is furthest along the line is predicted to contain more objects. Here images containing a test number of objects (see three examples above containing N=12, 16 and 20 objects) are compared with images containing the reference number of objects (vertical orange dashed line, N=16). The number of objects in the test image is plotted along the x axis and the proportion of comparisons that result in a "more" response are plotted on the y-axis (blue line). Human data from 10 subjects maldonado2020adaptation is plotted in green. (B) The position of images in the embedding space fall along a straight line that starts with 0, and continues monotonically with an increasing number of objects. Thus, the position of an image in the embedding line is an estimate for the number of objects in the scene. Here we demonstrate the outputs of such a model, where we rescale the embedding coordinate (an arbitrary unit) so that one unit of distance matches the distance between the "zero" and the "one" clusters. The y-axis represents such perceived numerosity, which is not necessarily an integer value. The red line indicates perfect prediction. Each violin plot (light blue) indicates the distribution of perceived numerosities for a given ground-truth number of objects. The width of the distributions for the higher counts indicates that perception is subject to errors. There is a slight underestimation bias for higher numbers, consistent with that seen in humans izard2008calibratingkrueger1982single. In fact, Krueger shows that human numerosity judgements (on images with 20 to 400 objects) follow a power function with an exponent of $0.83 \pm 0.2$. The green line and its shadow depict the range of human numerosity predictions on the same task. The orange lines are power function fits for seven models trained in the same fashion as Model B with different random initializations.