Table of Contents
Fetching ...

Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence

Emilio Jorge, Mikael Kågebäck, Fredrik D. Johansson, Emil Gustavsson

TL;DR

This work investigates emergent grounded language by training two agents to play a collaborative Guess Who? game using Deep Recurrent Q-Networks and differentiable inter-agent learning. The approach enables end-to-end, parameter-isolated agents to develop discrete, grounded vocabulary and multi-step dialogue that references visual concepts, with a noise-curriculum promoting robust language grounding. Extensive experiments on Guess Who? and CelebA show that larger vocabularies and memory-enabled interaction improve performance and that the learned language aligns with visual attributes while supporting context-dependent meaning. The findings highlight the feasibility of emergent grounded language in interactive, visually grounded environments and demonstrate scalable, interpretable communication without pre-defined protocols.

Abstract

Acquiring your first language is an incredible feat and not easily duplicated. Learning to communicate using nothing but a few pictureless books, a corpus, would likely be impossible even for humans. Nevertheless, this is the dominating approach in most natural language processing today. As an alternative, we propose the use of situated interactions between agents as a driving force for communication, and the framework of Deep Recurrent Q-Networks for evolving a shared language grounded in the provided environment. We task the agents with interactive image search in the form of the game Guess Who?. The images from the game provide a non trivial environment for the agents to discuss and a natural grounding for the concepts they decide to encode in their communication. Our experiments show that the agents learn not only to encode physical concepts in their words, i.e. grounding, but also that the agents learn to hold a multi-step dialogue remembering the state of the dialogue from step to step.

Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence

TL;DR

This work investigates emergent grounded language by training two agents to play a collaborative Guess Who? game using Deep Recurrent Q-Networks and differentiable inter-agent learning. The approach enables end-to-end, parameter-isolated agents to develop discrete, grounded vocabulary and multi-step dialogue that references visual concepts, with a noise-curriculum promoting robust language grounding. Extensive experiments on Guess Who? and CelebA show that larger vocabularies and memory-enabled interaction improve performance and that the learned language aligns with visual attributes while supporting context-dependent meaning. The findings highlight the feasibility of emergent grounded language in interactive, visually grounded environments and demonstrate scalable, interpretable communication without pre-defined protocols.

Abstract

Acquiring your first language is an incredible feat and not easily duplicated. Learning to communicate using nothing but a few pictureless books, a corpus, would likely be impossible even for humans. Nevertheless, this is the dominating approach in most natural language processing today. As an alternative, we propose the use of situated interactions between agents as a driving force for communication, and the framework of Deep Recurrent Q-Networks for evolving a shared language grounded in the provided environment. We task the agents with interactive image search in the form of the game Guess Who?. The images from the game provide a non trivial environment for the agents to discuss and a natural grounding for the concepts they decide to encode in their communication. Our experiments show that the agents learn not only to encode physical concepts in their words, i.e. grounding, but also that the agents learn to hold a multi-step dialogue remembering the state of the dialogue from step to step.

Paper Structure

This paper contains 16 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Schematic illustration of our version of the Guess Who? game.
  • Figure 2: Architecture of the model. The time dimension (question-answer rounds in the game) of the RNN goes from top to bottom. The green boxes illustrate the internal state of the network.
  • Figure 3: Performance of the model on Guess Who? images when the asking-agent has two images, a vocabulary of two, four, eight or sixteen different words and one round of question-answer is performed. The results are averaged over five runs. The dashed grey lines represents the baseline performance where the asking-agent guesses randomly. The performance of the model is ordered from the highest score with the largest vocabulary to the lowest score with the smallest vocabulary.
  • Figure 4: Performance of the model on Guess Who? images when the asking-agent has four images, a vocabulary of two, four, or eight different words and two rounds of question-answer is performed. The results are averages of five runs. The dashed grey lines represents the baseline performance where the asking-agent guesses randomly. The performance of the model is ordered from the highest score with the largest vocabulary to the lowest score with the smallest vocabulary.
  • Figure 5: Performance of the model on images from the CelebA dataset when the asking-agent has four images, two rounds of question-answer are performed and with a vocabulary of eight and sixteen words available. The dashed grey lines represents the baseline performance where the asking-agent guesses randomly.
  • ...and 4 more figures