Table of Contents
Fetching ...

Compositional Learning of Visually-Grounded Concepts Using Reinforcement

Zijun Lin, Haidi Azaman, M Ganesh Kumar, Cheston Tan

TL;DR

The results are the first to demonstrate that RL agents can be trained to implicitly learn concepts and compositionality, to solve more complex environments in zero-shot fashion.

Abstract

Children can rapidly generalize compositionally-constructed rules to unseen test sets. On the other hand, deep reinforcement learning (RL) agents need to be trained over millions of episodes, and their ability to generalize to unseen combinations remains unclear. Hence, we investigate the compositional abilities of RL agents, using the task of navigating to specified color-shape targets in synthetic 3D environments. First, we show that when RL agents are naively trained to navigate to target color-shape combinations, they implicitly learn to decompose the combinations, allowing them to (re-)compose these and succeed at held-out test combinations ("compositional learning"). Second, when agents are pretrained to learn invariant shape and color concepts ("concept learning"), the number of episodes subsequently needed for compositional learning decreased by 20 times. Furthermore, only agents trained on both concept and compositional learning could solve a more complex, out-of-distribution environment in zero-shot fashion. Finally, we verified that only text encoders pretrained on image-text datasets (e.g. CLIP) reduced the number of training episodes needed for our agents to demonstrate compositional learning, and also generalized to 5 unseen colors in zero-shot fashion. Overall, our results are the first to demonstrate that RL agents can be trained to implicitly learn concepts and compositionality, to solve more complex environments in zero-shot fashion.

Compositional Learning of Visually-Grounded Concepts Using Reinforcement

TL;DR

The results are the first to demonstrate that RL agents can be trained to implicitly learn concepts and compositionality, to solve more complex environments in zero-shot fashion.

Abstract

Children can rapidly generalize compositionally-constructed rules to unseen test sets. On the other hand, deep reinforcement learning (RL) agents need to be trained over millions of episodes, and their ability to generalize to unseen combinations remains unclear. Hence, we investigate the compositional abilities of RL agents, using the task of navigating to specified color-shape targets in synthetic 3D environments. First, we show that when RL agents are naively trained to navigate to target color-shape combinations, they implicitly learn to decompose the combinations, allowing them to (re-)compose these and succeed at held-out test combinations ("compositional learning"). Second, when agents are pretrained to learn invariant shape and color concepts ("concept learning"), the number of episodes subsequently needed for compositional learning decreased by 20 times. Furthermore, only agents trained on both concept and compositional learning could solve a more complex, out-of-distribution environment in zero-shot fashion. Finally, we verified that only text encoders pretrained on image-text datasets (e.g. CLIP) reduced the number of training episodes needed for our agents to demonstrate compositional learning, and also generalized to 5 unseen colors in zero-shot fashion. Overall, our results are the first to demonstrate that RL agents can be trained to implicitly learn concepts and compositionality, to solve more complex environments in zero-shot fashion.
Paper Structure (28 sections, 10 figures, 5 tables)

This paper contains 28 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Two example environments. Left column shows the top-view of the environments. Right column shows the RL agent's first-person view (128x128). Top and bottom rows show the C&S environment (target instruction is "red sphere") and the C&S&S environment (target instruction is "red sphere cylinder") respectively. Notice how the sizes and locations of the objects differ (compare top and bottom).
  • Figure 2: Left: Agent architecture. The language module of the one-hot encoder agent is at bottom right, boxed up within dashed lines. Red arrows or boxes represent trainable weights, while black arrows or boxes represent frozen weights. Right: Learning curve of the RL agent in train environment and test environments respectively.
  • Figure 3: Embedding of average LSTM activity for 50 instructions from C&S&S after training only on C&S (Top, "No Concept Learning"), compared to pretraining on C$|$S and then on C&S (Bottom, "With Concept Learning").
  • Figure 4: Word embeddings of the agent with BERT (top) and CLIP (bottom) text encoders after training in the C&S environment for 50K episodes. Filled icons represent training set examples, while unfilled icons with magenta labels represent testing set examples.
  • Figure 5: The language model of the agent with: A) Vanilla text encoder, B) BERT text encoder, C) CLIP text encoder are shown respectively. BERT text encoder and CLIP text encoder are both pretrained and frozen. Red arrows or boxes represent trainable weights, black arrows or boxes represent frozen weights.
  • ...and 5 more figures