Table of Contents
Fetching ...

Interactive Grounded Language Acquisition and Generalization in a 2D World

Haonan Yu, Haichao Zhang, Wei Xu

TL;DR

This work tackles interactive language learning with grounding in a 2D world by proposing explicit grounding, where words act as detectors tied to environment concepts via a shared concept-detection function for both grounding and language prediction. The model disentangles language grounding from downstream perception and control, enabling robust zero-shot generalization across new word combinations ($ZS1$) and new words learned from QA ($ZS2$). Trained on a dataset of over 1.6 million sentences in the xworld environment, it achieves strong performance against baselines and provides interpretable intermediate representations, with promising indications for extension to 3D worlds. Overall, the approach advances in-context word grounding and transfer, offering a scalable path for interactive language acquisition in embodied agents.

Abstract

We build a virtual agent for learning language in a 2D maze-like world. The agent sees images of the surrounding environment, listens to a virtual teacher, and takes actions to receive rewards. It interactively learns the teacher's language from scratch based on two language use cases: sentence-directed navigation and question answering. It learns simultaneously the visual representations of the world, the language, and the action control. By disentangling language grounding from other computational routines and sharing a concept detection function between language grounding and prediction, the agent reliably interpolates and extrapolates to interpret sentences that contain new word combinations or new words missing from training sentences. The new words are transferred from the answers of language prediction. Such a language ability is trained and evaluated on a population of over 1.6 million distinct sentences consisting of 119 object words, 8 color words, 9 spatial-relation words, and 50 grammatical words. The proposed model significantly outperforms five comparison methods for interpreting zero-shot sentences. In addition, we demonstrate human-interpretable intermediate outputs of the model in the appendix.

Interactive Grounded Language Acquisition and Generalization in a 2D World

TL;DR

This work tackles interactive language learning with grounding in a 2D world by proposing explicit grounding, where words act as detectors tied to environment concepts via a shared concept-detection function for both grounding and language prediction. The model disentangles language grounding from downstream perception and control, enabling robust zero-shot generalization across new word combinations () and new words learned from QA (). Trained on a dataset of over 1.6 million sentences in the xworld environment, it achieves strong performance against baselines and provides interpretable intermediate representations, with promising indications for extension to 3D worlds. Overall, the approach advances in-context word grounding and transfer, offering a scalable path for interactive language acquisition in embodied agents.

Abstract

We build a virtual agent for learning language in a 2D maze-like world. The agent sees images of the surrounding environment, listens to a virtual teacher, and takes actions to receive rewards. It interactively learns the teacher's language from scratch based on two language use cases: sentence-directed navigation and question answering. It learns simultaneously the visual representations of the world, the language, and the action control. By disentangling language grounding from other computational routines and sharing a concept detection function between language grounding and prediction, the agent reliably interpolates and extrapolates to interpret sentences that contain new word combinations or new words missing from training sentences. The new words are transferred from the answers of language prediction. Such a language ability is trained and evaluated on a population of over 1.6 million distinct sentences consisting of 119 object words, 8 color words, 9 spatial-relation words, and 50 grammatical words. The proposed model significantly outperforms five comparison methods for interpreting zero-shot sentences. In addition, we demonstrate human-interpretable intermediate outputs of the model in the appendix.

Paper Structure

This paper contains 23 sections, 13 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: An illustration of xworld and the two language use cases. (a) and (b): A mixed training of NAV and QA. (c): Testing ZS1 sentences contain a new combination of words ("east" and "avocado") that never appear together in any training sentence. (d): Testing ZS2 sentences contain a new word ("watermelon") that never appears in any training sentence but is learned from a training answer. This figure is only a conceptual illustration of language generalization; in practice it might take many training sessions before the agent can generalize. (Due to space limitations, the maps are only partially shown.)
  • Figure 2: An overview of the model. We process $e$ by always placing the agent at the center via zero padding. This helps the agent learn navigation actions by reducing the variety of target representations. $c$, $a$, and $v$ are the predicted answer, the navigation action, and the critic value for policy gradient, respectively. $\phi$ denotes the concept detection function shared by language grounding and prediction. $\mathbf{M}_A$ generates a compact representation from $x_{\text{loc}}$ and $h$ for navigation (Appendix \ref{['app:details']}).
  • Figure 3: An illustration of the attention cube $x_{\text{cube}}=x_{\text{loc}}\cdot{x_{\text{feat}}}^{\intercal}$, where $x_{\text{loc}}$ attends to image regions and $x_{\text{feat}}$ selects feature maps. In this example, $x_{\text{loc}}$ is computed from "northeast." In order for the agent to correctly answer "red" (color) instead of "watermelon" (object name), $x_{\text{feat}}$ has to be computed from the sentence pattern "What ... color ...?"
  • Figure 4: A symbolic example of the 2D convolution for transforming attention maps. A 2D convolution can be decomposed into two steps: flipping and cross correlation. The attention map of "northwest" is treated as an offset filter to translate that of "apple." Note that in practice, the attention is continuous and noisy, and the interpreter has to learn to find out the words (if any) to perform this convolution.
  • Figure 5: The three types of language data and their statistics.
  • ...and 13 more figures