Multimodal Contextualized Semantic Parsing from Speech

Jordan Voas; Raymond Mooney; David Harwath

Multimodal Contextualized Semantic Parsing from Speech

Jordan Voas, Raymond Mooney, David Harwath

TL;DR

This work defines Semantic Parsing in Contextual Environments ($\text{SPICE}$), a multimodal framework where agents iteratively update a structured knowledge graph $C_i$ through a formal parse $P_i = a(F_i^m, C_i)$ to obtain $C_{i+1} = e(P_i, C_i)$. It introduces VG-SPICE, a large-scale dataset derived from Visual Genome that simulates spoken-dialogue-driven visual scene-graph construction, and AViD-SP, a baseline Audio-Vision Dialogue Scene Parser that fuses audio, vision, and prior context via the Grouped Modality Attention Down Sampler (GMADS) and LoRa-tuned LLama 2. The evaluation relies on Graph Edit Distance (GED) and Representation Edit Distance (RED) with hard/soft variants to accommodate isomorphism-based outputs, under varying noise conditions and with ablations on modality usage and prior context. The results show meaningful multimodal updates with Soft-RED approaching ~0.4 and highlight the importance of ASR quality and historical context, while also acknowledging limitations related to synthetic data, visual-genome quality, and scope for future extensions to video, 3D environments, and richer paralinguistics.

Abstract

We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.

Multimodal Contextualized Semantic Parsing from Speech

TL;DR

This work defines Semantic Parsing in Contextual Environments (

), a multimodal framework where agents iteratively update a structured knowledge graph

through a formal parse

to obtain

. It introduces VG-SPICE, a large-scale dataset derived from Visual Genome that simulates spoken-dialogue-driven visual scene-graph construction, and AViD-SP, a baseline Audio-Vision Dialogue Scene Parser that fuses audio, vision, and prior context via the Grouped Modality Attention Down Sampler (GMADS) and LoRa-tuned LLama 2. The evaluation relies on Graph Edit Distance (GED) and Representation Edit Distance (RED) with hard/soft variants to accommodate isomorphism-based outputs, under varying noise conditions and with ablations on modality usage and prior context. The results show meaningful multimodal updates with Soft-RED approaching ~0.4 and highlight the importance of ASR quality and historical context, while also acknowledging limitations related to synthetic data, visual-genome quality, and scope for future extensions to video, 3D environments, and richer paralinguistics.

Abstract

Paper Structure (29 sections, 4 equations, 3 figures, 5 tables)

This paper contains 29 sections, 4 equations, 3 figures, 5 tables.

Introduction
Related Work
Dialogue Systems and Multimodality
Semantic Parsing
Task Definition
Dataset
Challenge Subset
AViD-SP Model
Training Routine
Evaluation Metrics
Graph Edit Distance (GED):
Representation Edit Distance (RED):
Baselines and Evaluation
Results
Conclusion
...and 14 more sections

Figures (3)

Figure 1: Example of VG-SPICE inputs as well as a plausible output to produce the correct next state context. New information that the agent is expected to add to the context is shown in green while already known information is noted in red. Grounding entities that have new information being added to them are noted in blue and orange. The current context is shown as a textually prompted representation of the actual knowledge graph (discussed in Section \ref{['sec: Contextual State Representation']}).
Figure 2: a) The architecture of the AViD-SP model for VG-SPICE, integrating pretrained encoders and large language models (LLMs) with LoRa adapters and feature fusion modules. Trained and frozen segments of the model are denoted by fire and snowflake icons, respectively. b) Our novel Grouped Modality Attention Down Sampler module, enabling integrated cross-modality fusion and downsampling. Green modules share weights. For downsampling, we utilize meanpooling, and for upsampling we linearly interpolate the embeddings.
Figure 3: Sample generation output with corresponding inputs from AViD-SP. Scored a Soft-RED of 0.0 and Hard-RED of 6.727. Significant features highlighted in colors. Qualitative evaluation reveals that the majority of extraneous additions were either supported by the Audio Transcription, the scene image, or both.

Multimodal Contextualized Semantic Parsing from Speech

TL;DR

Abstract

Multimodal Contextualized Semantic Parsing from Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (3)