Table of Contents
Fetching ...

MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

Pu-Hai Yang, Heyan Huang, Heng-Da Xu, Fanshu Sun, Xian-Ling Mao, Chaoxu Mu

TL;DR

MMWOZ introduces a web GUI–driven multimodal task-oriented dialogue dataset derived from MultiWOZ 2.3 and a baseline model, MATE, that jointly reasons over dialogue history, GUI operation logs, and web-page snapshots to either manipulate the GUI or generate user responses. The dataset is collected automatically by converting dialogue annotations into GUI instructions and capturing web-page snapshots, enabling end-to-end multimodal interaction without back-end APIs. Experimental results reveal the importance of dialogue history and action logs, the benefits and trade-offs of including OCR/text versus image features, and the challenges of domain transfer and layout adaptation for GUI-based agents. The work provides insights into building practical multimodal TOD systems that operate directly on GUI interfaces, with implications for real-world deployment and future research on robust GUI manipulation and cross-domain generalization.

Abstract

Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.

MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

TL;DR

MMWOZ introduces a web GUI–driven multimodal task-oriented dialogue dataset derived from MultiWOZ 2.3 and a baseline model, MATE, that jointly reasons over dialogue history, GUI operation logs, and web-page snapshots to either manipulate the GUI or generate user responses. The dataset is collected automatically by converting dialogue annotations into GUI instructions and capturing web-page snapshots, enabling end-to-end multimodal interaction without back-end APIs. Experimental results reveal the importance of dialogue history and action logs, the benefits and trade-offs of including OCR/text versus image features, and the challenges of domain transfer and layout adaptation for GUI-based agents. The work provides insights into building practical multimodal TOD systems that operate directly on GUI interfaces, with implications for real-world deployment and future research on robust GUI manipulation and cross-domain generalization.

Abstract

Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.

Paper Structure

This paper contains 24 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: An example of an agent interacting with a user in a traditional task-oriented dialogue system (From dialogue "MUL0001" in the MultiWOZ 2.3 dataset).
  • Figure 2: An example of using a web-style GUI to find and book a restaurant. The agent find 6 expensive Indian restaurants in the town centre and eventually book the user a table for 6 people at saffron brasserie for 19:30 on Saturday (Snapshot of the web page obtained by the agent after executing GUI operation instructions in the last turn in Figure \ref{['TODS']}).
  • Figure 3: Pseudocode for collecting operation instructions and web page snapshots
  • Figure 4: An example of how the data is organized in the MMWOZ dataset (From dialogue "SNG0073").
  • Figure 5: Distribution of operation types in different domains.
  • ...and 3 more figures