Table of Contents
Fetching ...

The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents

Kurt Shuster, Da Ju, Stephen Roller, Emily Dinan, Y-Lan Boureau, Jason Weston

TL;DR

<3-5 sentence high-level summary> Open-domain dialogue agents require a broad skill set. This work introduces dodecaDialogue, a 12-task multi-task benchmark spanning text-only and image-grounded scenarios to train a single multimodal transformer. It demonstrates that large-scale, dialogue-focused pretraining (especially on pushshift.io Reddit) combined with multi-task training yields strong, state-of-the-art results on many subtasks, while also analyzing grounding, decoding strategies, and zero-shot transfer. Human evaluations corroborate engagement improvements over prior baselines, establishing a robust baseline for future open-domain conversational systems.

Abstract

We introduce dodecaDialogue: a set of 12 tasks that measures if a conversational agent can communicate engagingly with personality and empathy, ask questions, answer questions by utilizing knowledge resources, discuss topics and situations, and perceive and converse about images. By multi-tasking on such a broad large-scale set of data, we hope to both move towards and measure progress in producing a single unified agent that can perceive, reason and converse with humans in an open-domain setting. We show that such multi-tasking improves over a BERT pre-trained baseline, largely due to multi-tasking with very large dialogue datasets in a similar domain, and that the multi-tasking in general provides gains to both text and image-based tasks using several metrics in both the fine-tune and task transfer settings. We obtain state-of-the-art results on many of the tasks, providing a strong baseline for this challenge.

The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents

TL;DR

<3-5 sentence high-level summary> Open-domain dialogue agents require a broad skill set. This work introduces dodecaDialogue, a 12-task multi-task benchmark spanning text-only and image-grounded scenarios to train a single multimodal transformer. It demonstrates that large-scale, dialogue-focused pretraining (especially on pushshift.io Reddit) combined with multi-task training yields strong, state-of-the-art results on many subtasks, while also analyzing grounding, decoding strategies, and zero-shot transfer. Human evaluations corroborate engagement improvements over prior baselines, establishing a robust baseline for future open-domain conversational systems.

Abstract

We introduce dodecaDialogue: a set of 12 tasks that measures if a conversational agent can communicate engagingly with personality and empathy, ask questions, answer questions by utilizing knowledge resources, discuss topics and situations, and perceive and converse about images. By multi-tasking on such a broad large-scale set of data, we hope to both move towards and measure progress in producing a single unified agent that can perceive, reason and converse with humans in an open-domain setting. We show that such multi-tasking improves over a BERT pre-trained baseline, largely due to multi-tasking with very large dialogue datasets in a similar domain, and that the multi-tasking in general provides gains to both text and image-based tasks using several metrics in both the fine-tune and task transfer settings. We obtain state-of-the-art results on many of the tasks, providing a strong baseline for this challenge.

Paper Structure

This paper contains 43 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Human evaluations on Image Chat and Wizard of Wikipedia (WoW), comparing existing state of the art models with our All Tasks MT conversational agent. Engagingness win rates are statistically significant in all three matchups (binomial test, $p <.05$).
  • Figure :
  • Figure :
  • Figure :
  • Figure :
  • ...and 1 more figures