HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Sunjae Yoon; Dahyun Kim; Eunseop Yoon; Hee Suk Yoon; Junyeong Kim; Chnag D. Yoo

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Sunjae Yoon, Dahyun Kim, Eunseop Yoon, Hee Suk Yoon, Junyeong Kim, Chnag D. Yoo

TL;DR

The paper tackles the deaf response problem in video-grounded dialogue by introducing HEAR, a framework that enables sensible listening and improved audibility to audio inputs. HEAR combines Sensible Audio Listening (SAL), which decides when to attend to audio using Keyword-based Audio Sensing and a Semantic Neural Estimator, with Reconstructive Listening Enhancement (RLE), which reconstructs masked audio through surrounding context and enforces an upper-bound training objective via Reconstruction Upper Bound. The method is model-agnostic and validated on AVSD@DSTC7/8, showing state-of-the-art results and clear improvements on audio-related questions, with ablations confirming the synergy between SAL and RLE. The work advances multimodal dialogue by integrating robust audio understanding into VGD, offering practical uplift for audio-sensitive questions and broad applicability to existing VGD systems.

Abstract

Video-grounded Dialogue (VGD) aims to answer questions regarding a given multi-modal input comprising video, audio, and dialogue history. Although there have been numerous efforts in developing VGD systems to improve the quality of their responses, existing systems are competent only to incorporate the information in the video and text and tend to struggle in extracting the necessary information from the audio when generating appropriate responses to the question. The VGD system seems to be deaf, and thus, we coin this symptom of current systems' ignoring audio data as a deaf response. To overcome the deaf response problem, Hearing Enhanced Audio Response (HEAR) framework is proposed to perform sensible listening by selectively attending to audio whenever the question requires it. The HEAR framework enhances the accuracy and audibility of VGD systems in a model-agnostic manner. HEAR is validated on VGD datasets (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows effectiveness with various VGD systems.

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

TL;DR

Abstract

Paper Structure (30 sections, 11 equations, 13 figures, 7 tables)

This paper contains 30 sections, 11 equations, 13 figures, 7 tables.

Introduction
Related works
Video-grounded Dialogues
Task Definition
Hearing Enhanced Audio Response
Input representations
Dialogue Language Model
Sensible Audio Listening
Keyword-based Audio Sensing.
Semantic Neural Estimator.
Reconstructive Listening Enhancement
Audio Reconstruction.
Reconstruction Upper Bound.
Optimization and Inference
Experiments
...and 15 more sections

Figures (13)

Figure 1: Current VGD system's deaf responses on questions about audio: (a) Audio is considered not present and (b) Audio is disregarded as background noise.
Figure 2: Current VGD systems' performances on AVSD dataset (validation): (a) Response performances according to training with and without audio, (b) Average performance drops on the questions about audio.
Figure 3: Illustration of Hearing Enhanced Audio Response Framework (HEAR) for video-grounded dialogue. HEAR performs sensible listening via (a) Sensible Audio Listening that selectively attends to audio corresponding to a given question and improves audibility via (b) Reconstructive Listening Enhancement that enhances audio representations by establishing a reconstruction upper bound to connect audio with its surrounding information.
Figure 4: Examples of questions: (a) Predicted audio-related questions by keyword-based audio sensing and (b) Outliers of keyword-based audio sensing.
Figure 5: Illustrations of surrounding masking. The distance $n$ decides the extent of surrounding masking.
...and 8 more figures

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

TL;DR

Abstract

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Authors

TL;DR

Abstract

Table of Contents

Figures (13)