Table of Contents
Fetching ...

Grounded Language Learning Fast and Slow

Felix Hill, Olivier Tieleman, Tamara von Glehn, Nathaniel Wong, Hamza Merzic, Stephen Clark

TL;DR

The paper investigates fast-mapping in an embodied agent by pairing a two-phase Unity environment with multi-modal memory architectures. It demonstrates that dual-coding external memories (DCEM) and TransformerXL-based cores can achieve one-shot word-object bindings and generalize to novel objects and categories, especially when augmented with an observation reconstruction loss and intrinsic motivation. The findings show that explicit external memory offers memory-efficient advantages over purely transformer-based approaches and that temporal context and meta-training across object counts critically support generalization. By integrating fast and slow learning and validating in a second environment, the work highlights the potential of memory-augmented RL to emulate core aspects of human language acquisition in interactive agents.

Abstract

Recent work has shown that large text-based neural language models, trained with conventional supervised learning objectives, acquire a surprising propensity for few- and one-shot learning. Here, we show that an embodied agent situated in a simulated 3D world, and endowed with a novel dual-coding external memory, can exhibit similar one-shot word learning when trained with conventional reinforcement learning algorithms. After a single introduction to a novel object via continuous visual perception and a language prompt ("This is a dax"), the agent can re-identify the object and manipulate it as instructed ("Put the dax on the bed"). In doing so, it seamlessly integrates short-term, within-episode knowledge of the appropriate referent for the word "dax" with long-term lexical and motor knowledge acquired across episodes (i.e. "bed" and "putting"). We find that, under certain training conditions and with a particular memory writing mechanism, the agent's one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category, and is effective in settings with unfamiliar numbers of objects. We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful for later executing instructions. Together, the results demonstrate that deep neural networks can exploit meta-learning, episodic memory and an explicitly multi-modal environment to account for 'fast-mapping', a fundamental pillar of human cognitive development and a potentially transformative capacity for agents that interact with human users.

Grounded Language Learning Fast and Slow

TL;DR

The paper investigates fast-mapping in an embodied agent by pairing a two-phase Unity environment with multi-modal memory architectures. It demonstrates that dual-coding external memories (DCEM) and TransformerXL-based cores can achieve one-shot word-object bindings and generalize to novel objects and categories, especially when augmented with an observation reconstruction loss and intrinsic motivation. The findings show that explicit external memory offers memory-efficient advantages over purely transformer-based approaches and that temporal context and meta-training across object counts critically support generalization. By integrating fast and slow learning and validating in a second environment, the work highlights the potential of memory-augmented RL to emulate core aspects of human language acquisition in interactive agents.

Abstract

Recent work has shown that large text-based neural language models, trained with conventional supervised learning objectives, acquire a surprising propensity for few- and one-shot learning. Here, we show that an embodied agent situated in a simulated 3D world, and endowed with a novel dual-coding external memory, can exhibit similar one-shot word learning when trained with conventional reinforcement learning algorithms. After a single introduction to a novel object via continuous visual perception and a language prompt ("This is a dax"), the agent can re-identify the object and manipulate it as instructed ("Put the dax on the bed"). In doing so, it seamlessly integrates short-term, within-episode knowledge of the appropriate referent for the word "dax" with long-term lexical and motor knowledge acquired across episodes (i.e. "bed" and "putting"). We find that, under certain training conditions and with a particular memory writing mechanism, the agent's one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category, and is effective in settings with unfamiliar numbers of objects. We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful for later executing instructions. Together, the results demonstrate that deep neural networks can exploit meta-learning, episodic memory and an explicitly multi-modal environment to account for 'fast-mapping', a fundamental pillar of human cognitive development and a potentially transformative capacity for agents that interact with human users.

Paper Structure

This paper contains 32 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Top: The two phases of a fast-mapping episode. Bottom: Screenshots of the task from the agent's perspective at important moments (including the contents of the language channel).
  • Figure 2: Accuracy of agents trained on probe trials involving a different number of total objects for agents meta-trained with different numbers of total objects.
  • Figure 3: Accuracy during training and evaluation trials involving unfamiliar objects, for different sizes of global training set $G$. Curves show mean $\pm$ S.E. over 3 agent seeds in each condition.
  • Figure 4: Accuracy of agents in fast-mapping trials requiring the extension of ShapeNet categories from a single exemplar. Curves show the mean $\pm$ S.E. over three agent seeds in each condition.
  • Figure 5: Accuracy of agents trained without shaping reward on the 3-object fast-mapping task with $\vert G \vert = 30$. Curves show mean $\pm$ S.E. across three seeds in each condition.
  • ...and 5 more figures