Table of Contents
Fetching ...

Development of Compositionality and Generalization through Interactive Learning of Language and Action of Robots

Prasanna Vijayaraghavan, Jeffrey Frederic Queisser, Sergio Verduzco Flores, Jun Tani

TL;DR

A brain-inspired neural network model is proposed that integrates vision, proprioception, and language into a framework of predictive coding and active inference, based on the free-energy principle that shows that generalization in learning to unlearned verb-noun compositions, is significantly enhanced when training variations of task composition are increased.

Abstract

Humans excel at applying learned behavior to unlearned situations. A crucial component of this generalization behavior is our ability to compose/decompose a whole into reusable parts, an attribute known as compositionality. One of the fundamental questions in robotics concerns this characteristic. "How can linguistic compositionality be developed concomitantly with sensorimotor skills through associative learning, particularly when individuals only learn partial linguistic compositions and their corresponding sensorimotor patterns?" To address this question, we propose a brain-inspired neural network model that integrates vision, proprioception, and language into a framework of predictive coding and active inference, based on the free-energy principle. The effectiveness and capabilities of this model were assessed through various simulation experiments conducted with a robot arm. Our results show that generalization in learning to unlearned verb-noun compositions, is significantly enhanced when training variations of task composition are increased. We attribute this to self-organized compositional structures in linguistic latent state space being influenced significantly by sensorimotor learning. Ablation studies show that visual attention and working memory are essential to accurately generate visuo-motor sequences to achieve linguistically represented goals. These insights advance our understanding of mechanisms underlying development of compositionality through interactions of linguistic and sensorimotor experience.

Development of Compositionality and Generalization through Interactive Learning of Language and Action of Robots

TL;DR

A brain-inspired neural network model is proposed that integrates vision, proprioception, and language into a framework of predictive coding and active inference, based on the free-energy principle that shows that generalization in learning to unlearned verb-noun compositions, is significantly enhanced when training variations of task composition are increased.

Abstract

Humans excel at applying learned behavior to unlearned situations. A crucial component of this generalization behavior is our ability to compose/decompose a whole into reusable parts, an attribute known as compositionality. One of the fundamental questions in robotics concerns this characteristic. "How can linguistic compositionality be developed concomitantly with sensorimotor skills through associative learning, particularly when individuals only learn partial linguistic compositions and their corresponding sensorimotor patterns?" To address this question, we propose a brain-inspired neural network model that integrates vision, proprioception, and language into a framework of predictive coding and active inference, based on the free-energy principle. The effectiveness and capabilities of this model were assessed through various simulation experiments conducted with a robot arm. Our results show that generalization in learning to unlearned verb-noun compositions, is significantly enhanced when training variations of task composition are increased. We attribute this to self-organized compositional structures in linguistic latent state space being influenced significantly by sensorimotor learning. Ablation studies show that visual attention and working memory are essential to accurately generate visuo-motor sequences to achieve linguistically represented goals. These insights advance our understanding of mechanisms underlying development of compositionality through interactions of linguistic and sensorimotor experience.
Paper Structure (25 sections, 28 equations, 17 figures, 8 tables, 3 algorithms)

This paper contains 25 sections, 28 equations, 17 figures, 8 tables, 3 algorithms.

Figures (17)

  • Figure 1: (A) Model Architecture: Each modality generates visual, proprioceptive or linguistic predictions. Visual (conv-LSTM) and Proprioceptive (LSTM) modalities are integrated by the Associative PV-RNN and Linguistic LSTM is bound to Associative PV-RNN via Parametric Bias (PB). Visual predictions are enhanced by two visual working memory (VWM-1 amd VWM-2) and attention mechanism, for which the parameters are generated by the Proprioceptive LSTM. For a given linguistic goal "put green on blue", (B) top: predicted visual sequence, bottom: observed ground truth; (C) top: joint angle trajectory predicted by the model compared with the ground truth, bottom: motor prediction error.
  • Figure 2: Goal-directed planning using active inference: the model generated the above visuo-proprioceptive sequence for the linguistically specified goal "put green on blue .". (A) masked representaiton of VWM-2; (B) VWM-1; (C) model prediction of attended visual stream; (D) final simulated prediction of the visual stream, the red box indicates coordinates for attention predicted by the proprioceptive LSTM; (E) the ground truth target for comparison; (F) difference between the predicted visual stream and the ground truth target; (G) normalized joint angle trajectory predicted by the model compared with the corresponding ground truth; and (H) mean difference between the predicted joint angles and the ground truth.
  • Figure 3: Comparison of average visuo-proprioceptive error, for inference of visuo-proprioceptive plans, between unlearned object positions (U-P) and unlearned compositions (U-C) among groups with different number of compositions with the highest training ratio of 80$\%$.
  • Figure 4: Comparison of average visuo-proprioceptive error, between groups with different training ratios for inference of visuo-proprioceptive plans to achieve unlearned compositions (U-C) of linguistically represented goals.
  • Figure 5: Scatterplot of mean Kernel PCA values of latent state $\mathbf{PB}$ vectors for all groups with the highest training ratio. (A): Group A1 (5x8, 80%) (B): Group B1 (5x6, 80%) (C): Group C1 (5x3, 80%) (D): Group D1 (3x3, 77%). U-C refers to unlearned compositions that were used for testing. Colors of markers indicate the color of the object being manipulated. The variance explained by the two components of KPCA for all groups was greater than 90%
  • ...and 12 more figures