Multimodal Contextualized Semantic Parsing from Speech
Jordan Voas, Raymond Mooney, David Harwath
TL;DR
This work defines Semantic Parsing in Contextual Environments ($\text{SPICE}$), a multimodal framework where agents iteratively update a structured knowledge graph $C_i$ through a formal parse $P_i = a(F_i^m, C_i)$ to obtain $C_{i+1} = e(P_i, C_i)$. It introduces VG-SPICE, a large-scale dataset derived from Visual Genome that simulates spoken-dialogue-driven visual scene-graph construction, and AViD-SP, a baseline Audio-Vision Dialogue Scene Parser that fuses audio, vision, and prior context via the Grouped Modality Attention Down Sampler (GMADS) and LoRa-tuned LLama 2. The evaluation relies on Graph Edit Distance (GED) and Representation Edit Distance (RED) with hard/soft variants to accommodate isomorphism-based outputs, under varying noise conditions and with ablations on modality usage and prior context. The results show meaningful multimodal updates with Soft-RED approaching ~0.4 and highlight the importance of ASR quality and historical context, while also acknowledging limitations related to synthetic data, visual-genome quality, and scope for future extensions to video, 3D environments, and richer paralinguistics.
Abstract
We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.
