Table of Contents
Fetching ...

Concepts Learned Visually by Infants Can Contribute to Visual Learning and Understanding in AI Models

Shify Treger, Shimon Ullman

Abstract

Early in development, infants learn to extract surprisingly complex aspects of visual scenes. This early learning comes together with an initial understanding of the extracted concepts, such as their implications, causality, and using them to predict likely future events. In many cases, this learning is obtained with little or no supervision, and from relatively few examples, compared to current network models. Empirical studies of visual perception in early development have shown that in the domain of objects and human-object interactions, early-acquired concepts are often used in the process of learning additional, more complex concepts. In the current work, we model how early-acquired concepts are used in the learning of subsequent concepts, and compare the results with standard deep network modeling. We focused in particular on the use of the concepts of animacy and goal attribution in learning to predict future events in dynamic visual scenes. We show that the use of early concepts in the learning of new concepts leads to better learning (higher accuracy) and more efficient learning (requiring less data), and that the combination of early and new concepts shapes the representation of the concepts acquired by the model and improves its generalization. We further compare advanced vision-language models to a human study in a task that requires an understanding of the behavior of animate vs. inanimate agents, with results supporting the contribution of early concepts to visual understanding. We finally briefly discuss the possible benefits of incorporating aspects of human-like visual learning into computer vision models.

Concepts Learned Visually by Infants Can Contribute to Visual Learning and Understanding in AI Models

Abstract

Early in development, infants learn to extract surprisingly complex aspects of visual scenes. This early learning comes together with an initial understanding of the extracted concepts, such as their implications, causality, and using them to predict likely future events. In many cases, this learning is obtained with little or no supervision, and from relatively few examples, compared to current network models. Empirical studies of visual perception in early development have shown that in the domain of objects and human-object interactions, early-acquired concepts are often used in the process of learning additional, more complex concepts. In the current work, we model how early-acquired concepts are used in the learning of subsequent concepts, and compare the results with standard deep network modeling. We focused in particular on the use of the concepts of animacy and goal attribution in learning to predict future events in dynamic visual scenes. We show that the use of early concepts in the learning of new concepts leads to better learning (higher accuracy) and more efficient learning (requiring less data), and that the combination of early and new concepts shapes the representation of the concepts acquired by the model and improves its generalization. We further compare advanced vision-language models to a human study in a task that requires an understanding of the behavior of animate vs. inanimate agents, with results supporting the contribution of early concepts to visual understanding. We finally briefly discuss the possible benefits of incorporating aspects of human-like visual learning into computer vision models.

Paper Structure

This paper contains 44 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Input data for Experiment 1: The model receives a sequence of three images and predicts the location ('left' or 'right') of the actor in the next unseen step of the sequence. Note that the objects switch locations in the third step. Top: animate actors; Bottom: inanimate actors. The person follows the object, the suitcase goes to the previous location.
  • Figure 2: Two-step process of the cognitive and naive models: First, scene representations are created, and then predictions are made. Concepts specific only to the cognitive model are shown in bold and purple.
  • Figure 3: Experiment 1: Prediction Results. Average test accuracy of the naive and cognitive networks for both the small and large datasets, with the Standard Error of the Mean (SEM) included.
  • Figure 4: Input Data for Experiments 2 and 3: The five rightmost frames are used in Experiment 2, while the full seven frames are used in Experiment 3.
  • Figure 5: Experiment 2.1: Generalization to New Actors Results. Average test accuracy of the naive and cognitive networks on Task 1 (T1), retraining on Task 1 (T1-T1), and retraining on Task 2 (T1-T2), with SEM included.
  • ...and 7 more figures