The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents
Kurt Shuster, Da Ju, Stephen Roller, Emily Dinan, Y-Lan Boureau, Jason Weston
TL;DR
<3-5 sentence high-level summary> Open-domain dialogue agents require a broad skill set. This work introduces dodecaDialogue, a 12-task multi-task benchmark spanning text-only and image-grounded scenarios to train a single multimodal transformer. It demonstrates that large-scale, dialogue-focused pretraining (especially on pushshift.io Reddit) combined with multi-task training yields strong, state-of-the-art results on many subtasks, while also analyzing grounding, decoding strategies, and zero-shot transfer. Human evaluations corroborate engagement improvements over prior baselines, establishing a robust baseline for future open-domain conversational systems.
Abstract
We introduce dodecaDialogue: a set of 12 tasks that measures if a conversational agent can communicate engagingly with personality and empathy, ask questions, answer questions by utilizing knowledge resources, discuss topics and situations, and perceive and converse about images. By multi-tasking on such a broad large-scale set of data, we hope to both move towards and measure progress in producing a single unified agent that can perceive, reason and converse with humans in an open-domain setting. We show that such multi-tasking improves over a BERT pre-trained baseline, largely due to multi-tasking with very large dialogue datasets in a similar domain, and that the multi-tasking in general provides gains to both text and image-based tasks using several metrics in both the fine-tune and task transfer settings. We obtain state-of-the-art results on many of the tasks, providing a strong baseline for this challenge.
