Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Xiaolin Chen; Xuemeng Song; Liqiang Jing; Shuo Li; Linmei Hu; Liqiang Nie

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Xiaolin Chen, Xuemeng Song, Liqiang Jing, Shuo Li, Linmei Hu, Liqiang Nie

TL;DR

This paper tackles text response generation in multimodal task-oriented dialog systems by integrating generative pretrained language models with dual knowledge selection and cross-modal context learning. The proposed DKMD framework comprises dual knowledge selection (textual and visual), dual knowledge-enhanced context learning (global textual and local visual representations with cross-modal refinement), and a knowledge-enhanced decoder with a dot-product attention to explicitly utilize knowledge during generation. Empirical results on the MMConv/MMConv-derived MMConv-style setting show that DKMD outperforms strong baselines with large gains in BLEU and NIST scores, and ablations confirm the necessity of both knowledge sources and the dual refinement mechanisms. The work highlights the value of combining GPLMs with structured multimodal knowledge to improve relevance and factual grounding in dialog responses, with released code enabling further exploration.

Abstract

Text response generation for multimodal task-oriented dialog systems, which aims to generate the proper text response given the multimodal context, is an essential yet challenging task. Although existing efforts have achieved compelling success, they still suffer from two pivotal limitations: 1) overlook the benefit of generative pre-training, and 2) ignore the textual context related knowledge. To address these limitations, we propose a novel dual knowledge-enhanced generative pretrained language model for multimodal task-oriented dialog systems (DKMD), consisting of three key components: dual knowledge selection, dual knowledge-enhanced context learning, and knowledge-enhanced response generation. To be specific, the dual knowledge selection component aims to select the related knowledge according to both textual and visual modalities of the given context. Thereafter, the dual knowledge-enhanced context learning component targets seamlessly integrating the selected knowledge into the multimodal context learning from both global and local perspectives, where the cross-modal semantic relation is also explored. Moreover, the knowledge-enhanced response generation component comprises a revised BART decoder, where an additional dot-product knowledge-decoder attention sub-layer is introduced for explicitly utilizing the knowledge to advance the text response generation. Extensive experiments on a public dataset verify the superiority of the proposed DKMD over state-of-the-art competitors.

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

TL;DR

Abstract

Paper Structure (20 sections, 14 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 14 equations, 9 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Task-oriented Dialog Systems
Pretrained Language Models
Preliminary
Model
Problem Formulation
Dual Knowledge Selection
Dual Knowledge-enhanced Context Learning
Knowledge-enhanced Context Representation
Dual Cross-modal Representation Refinement
Knowledge-enhanced Response Generation
Experiment
Dataset
Experiment Setting
...and 5 more sections

Figures (9)

Figure 1: Illustration of a multimodal dialog system between a user and an agent. "u": utterance.
Figure 2: Illustration of the proposed DKMD, which consists of three vital components: (a) Dual Knowledge Selection, (b) Dual Knowledge-enhanced Context Learning, and (c) Knowledge-enhanced Response Generation. 'KB': Knowledge base. 'K': Knowledge.
Figure 3: Schematic illustration of the original BART decoder and the revised BART decoder.
Figure 4: Workflow of the proposed DKMD. "T" and "V" denote Textual and Visual, respectively.
Figure 5: Convergence analysis of our proposed DKMD.
...and 4 more figures

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

TL;DR

Abstract

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (9)