Table of Contents
Fetching ...

Human-like compositional learning of visually-grounded concepts using synthetic environments

Zijun Lin, M Ganesh Kumar, Cheston Tan

TL;DR

This paper tackles the problem of grounding complex linguistic concepts, specifically determiners and prepositions, to visual targets in a multi-modal reinforcement learning setting using synthetic 3D environments. It introduces three controllable environments and a curriculum-based training regime to analyze how agents learn, decompose, and recombine concepts to navigate to instructed targets. The key findings show that determiners can be grounded with relatively modest data, while prepositions pose greater challenges that are substantially mitigated by curriculum learning; agents trained on determiner or preposition concepts can generalize to held-out I.I.D. instructions and rapidly adapt to unseen combinations in O.O.D. settings when equipped with appropriate pretraining and curriculum strategies. The results highlight the potential of human-like learning curricula to improve learning efficiency and compositional generalization in multi-modal RL, with implications for aligning human-AI interaction in real-world referring expressions and navigation tasks.

Abstract

The compositional structure of language enables humans to decompose complex phrases and map them to novel visual concepts, showcasing flexible intelligence. While several algorithms exhibit compositionality, they fail to elucidate how humans learn to compose concept classes and ground visual cues through trial and error. To investigate this multi-modal learning challenge, we designed a 3D synthetic environment in which an agent learns, via reinforcement, to navigate to a target specified by a natural language instruction. These instructions comprise nouns, attributes, and critically, determiners, prepositions, or both. The vast array of word combinations heightens the compositional complexity of the visual grounding task, as navigating to a blue cube above red spheres is not rewarded when the instruction specifies navigating to "some blue cubes below the red sphere". We first demonstrate that reinforcement learning agents can ground determiner concepts to visual targets but struggle with more complex prepositional concepts. Second, we show that curriculum learning, a strategy humans employ, enhances concept learning efficiency, reducing the required training episodes by 15% in determiner environments and enabling agents to easily learn prepositional concepts. Finally, we establish that agents trained on determiner or prepositional concepts can decompose held-out test instructions and rapidly adapt their navigation policies to unseen visual object combinations. Leveraging synthetic environments, our findings demonstrate that multi-modal reinforcement learning agents can achieve compositional understanding of complex concept classes and highlight the efficacy of human-like learning strategies in improving artificial systems' learning efficiency.

Human-like compositional learning of visually-grounded concepts using synthetic environments

TL;DR

This paper tackles the problem of grounding complex linguistic concepts, specifically determiners and prepositions, to visual targets in a multi-modal reinforcement learning setting using synthetic 3D environments. It introduces three controllable environments and a curriculum-based training regime to analyze how agents learn, decompose, and recombine concepts to navigate to instructed targets. The key findings show that determiners can be grounded with relatively modest data, while prepositions pose greater challenges that are substantially mitigated by curriculum learning; agents trained on determiner or preposition concepts can generalize to held-out I.I.D. instructions and rapidly adapt to unseen combinations in O.O.D. settings when equipped with appropriate pretraining and curriculum strategies. The results highlight the potential of human-like learning curricula to improve learning efficiency and compositional generalization in multi-modal RL, with implications for aligning human-AI interaction in real-world referring expressions and navigation tasks.

Abstract

The compositional structure of language enables humans to decompose complex phrases and map them to novel visual concepts, showcasing flexible intelligence. While several algorithms exhibit compositionality, they fail to elucidate how humans learn to compose concept classes and ground visual cues through trial and error. To investigate this multi-modal learning challenge, we designed a 3D synthetic environment in which an agent learns, via reinforcement, to navigate to a target specified by a natural language instruction. These instructions comprise nouns, attributes, and critically, determiners, prepositions, or both. The vast array of word combinations heightens the compositional complexity of the visual grounding task, as navigating to a blue cube above red spheres is not rewarded when the instruction specifies navigating to "some blue cubes below the red sphere". We first demonstrate that reinforcement learning agents can ground determiner concepts to visual targets but struggle with more complex prepositional concepts. Second, we show that curriculum learning, a strategy humans employ, enhances concept learning efficiency, reducing the required training episodes by 15% in determiner environments and enabling agents to easily learn prepositional concepts. Finally, we establish that agents trained on determiner or prepositional concepts can decompose held-out test instructions and rapidly adapt their navigation policies to unseen visual object combinations. Leveraging synthetic environments, our findings demonstrate that multi-modal reinforcement learning agents can achieve compositional understanding of complex concept classes and highlight the efficacy of human-like learning strategies in improving artificial systems' learning efficiency.

Paper Structure

This paper contains 18 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Example environments, each with four target options. A reward of +10 is given when the agent navigates to the target matching the instruction. Punishments of -1, -3, and -10 are incurred for hitting a wall, reaching the wrong target, or failing to reach the correct target, respectively. A: Agent's view of the determiner ($D$) environment. B: Agent's view of the preposition ($P$) environment. C: Close-up view of the target object in $P$. D: Agent's view of the combined determiner and preposition ($D+P$) environment.
  • Figure 2: Left: Agent architecture. Red arrows or boxes represent trainable weights, while black arrows or boxes represent frozen weights. Right: Success rates of the agents with and without curriculum learning in $D$.