Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Se Jin Park; Chae Won Kim; Hyeongseop Rha; Minsu Kim; Joanna Hong; Jeong Hun Yeo; Yong Man Ro

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeong Hun Yeo, Yong Man Ro

TL;DR

The paper addresses the challenge of face-to-face dialogue by proposing a direct audio-visual spoken dialogue model that operates without intermediate text. It introduces MultiDialog, the largest multimodal dialogue corpus to date, and develops a joint speech-text pretraining pipeline that adapts a pretrained LLM to AV dialogue through AV speech tokenization. The proposed system demonstrates superior semantic fidelity and high-quality AV generation, with robustness to acoustic noise, highlighting its potential for avatar chatbots and multimodal synthesis. By releasing both the dataset and the demo, the work enables broad advancement in multimodal dialogue and talking-face synthesis research.

Abstract

In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. Demo and data are available at https://multidialog.github.io and https://huggingface.co/datasets/IVLLab/MultiDialog, respectively.

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

TL;DR

Abstract

Paper Structure (29 sections, 3 equations, 5 figures, 7 tables)

This paper contains 29 sections, 3 equations, 5 figures, 7 tables.

Introduction
Related Work
Spoken Dialogue Dataset
Spoken Dialogue Models
MultiDialog Dataset
Preparation
Recording
Post-Processing
Audio-Visual Spoken Dialogue System
Audio-Visual Speech Encoding
Audio-Visual Spoken Dialogue Language Modeling
Audio-Visual Generation
Experimental Setup
Evaluation Metrics
Implementation Details
...and 14 more sections

Figures (5)

Figure 1: Overview of the proposed framework for multimodal spoken dialogue language modeling. With the AV speech tokens as the pseudo-texts, it can process audio-visual face video from the user input and generate corresponding response as audio-visual face video.
Figure 2: Constructed data based on the MultiDialog dataset used for training the audio-visual speech dialogue model. (a-c) are joint pretraining of the audio-visual speech and text tokens and (d) is used to finetune the model.
Figure 3: Evaluation prompt of multimodal dialogue language modeling. It is written in text for illustration but the actual prompt is given as audio and visual.
Figure 4: Audio-visual dialogue generation results of the proposed method, where the last turn is the generated audio-visual response. Note that we have randomly sampled three video frames from each turn for illustration. (a-d) are conversations with four turns and (e-f) are with two turns, The generated responses are in italics and we provide ASR transcriptions below.
Figure 5: Recording studio setup for MultiDialog dataset

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

TL;DR

Abstract

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)