Table of Contents
Fetching ...

Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding

Di Wu, Liting Jiang, Ruiyu Fang, Bianjing, Hongyan Xie, Haoxiang Su, Hao Huang, Zhongjiang He, Shuangyong Song, Xuelong Li

TL;DR

VRSLU introduces a realistic SLU benchmark that jointly leverages visual context and explicit reasoning to address ambiguity in spoken language understanding. The core method LR-Instruct prompts models to first predict labels and then generate reasoning, mitigating the risk that reasoning biases degrade label accuracy. Experimental results across multiple multimodal LLMs show that visual CA and explicit reasoning improve intent detection, slot filling, and overall semantic frame accuracy, while also enhancing interpretability. The work demonstrates the practical value of integrating visual context and reasoning in SLU, with strong potential for deployment in real-world TOD systems.

Abstract

Spoken Language Understanding (SLU) consists of two sub-tasks: intent detection (ID) and slot filling (SF). Given its broad range of real-world applications, enhancing SLU for practical deployment is increasingly critical. Profile-based SLU addresses ambiguous user utterances by incorporating context awareness (CA), user profiles (UP), and knowledge graphs (KG) to support disambiguation, thereby advancing SLU research toward real-world applicability. However, existing SLU datasets still fall short in representing real-world scenarios. Specifically, (1) CA uses one-hot vectors for representation, which is overly idealized, and (2) models typically focuses solely on predicting intents and slot labels, neglecting the reasoning process that could enhance performance and interpretability. To overcome these limitations, we introduce VRSLU, a novel SLU dataset that integrates both Visual images and explicit Reasoning. For over-idealized CA, we use GPT-4o and FLUX.1-dev to generate images reflecting users' environments and statuses, followed by human verification to ensure quality. For reasoning, GPT-4o is employed to generate explanations for predicted labels, which are then refined by human annotators to ensure accuracy and coherence. Additionally, we propose an instructional template, LR-Instruct, which first predicts labels and then generates corresponding reasoning. This two-step approach helps mitigate the influence of reasoning bias on label prediction. Experimental results confirm the effectiveness of incorporating visual information and highlight the promise of explicit reasoning in advancing SLU.

Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding

TL;DR

VRSLU introduces a realistic SLU benchmark that jointly leverages visual context and explicit reasoning to address ambiguity in spoken language understanding. The core method LR-Instruct prompts models to first predict labels and then generate reasoning, mitigating the risk that reasoning biases degrade label accuracy. Experimental results across multiple multimodal LLMs show that visual CA and explicit reasoning improve intent detection, slot filling, and overall semantic frame accuracy, while also enhancing interpretability. The work demonstrates the practical value of integrating visual context and reasoning in SLU, with strong potential for deployment in real-world TOD systems.

Abstract

Spoken Language Understanding (SLU) consists of two sub-tasks: intent detection (ID) and slot filling (SF). Given its broad range of real-world applications, enhancing SLU for practical deployment is increasingly critical. Profile-based SLU addresses ambiguous user utterances by incorporating context awareness (CA), user profiles (UP), and knowledge graphs (KG) to support disambiguation, thereby advancing SLU research toward real-world applicability. However, existing SLU datasets still fall short in representing real-world scenarios. Specifically, (1) CA uses one-hot vectors for representation, which is overly idealized, and (2) models typically focuses solely on predicting intents and slot labels, neglecting the reasoning process that could enhance performance and interpretability. To overcome these limitations, we introduce VRSLU, a novel SLU dataset that integrates both Visual images and explicit Reasoning. For over-idealized CA, we use GPT-4o and FLUX.1-dev to generate images reflecting users' environments and statuses, followed by human verification to ensure quality. For reasoning, GPT-4o is employed to generate explanations for predicted labels, which are then refined by human annotators to ensure accuracy and coherence. Additionally, we propose an instructional template, LR-Instruct, which first predicts labels and then generates corresponding reasoning. This two-step approach helps mitigate the influence of reasoning bias on label prediction. Experimental results confirm the effectiveness of incorporating visual information and highlight the promise of explicit reasoning in advancing SLU.

Paper Structure

This paper contains 23 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The construction process of scene images and reasoning in VRSLU: scene images (b) are generated based on the one-hot vectors (a) in CA, while reasoning are constructed by integrating the utterance, CA, UP, KG, intent, and slot labels (c).
  • Figure 2: Overall accuracy of General SLU and ProHAN under w/ CA and w/o CA Settings.
  • Figure 3: The process of image construction. Where, (a) shows the prompt, (b) represents CA one-hot vectors, (c) is the corresponding descriptive paragraph, (d) and (e) are invalid images, highlighted in red, and (f) is a valid image.
  • Figure 4: The process of reasoning construction. Where, (a) is the prompt, (b) is an example (information irrelevant to the user's request has been omitted for clarity), (c) is the original reasoning generated by GPT-4o with errors highlighted in red, and (d) is the manually corrected result.
  • Figure 5: The proposed LR-Instruct for VRSLU. Due to space limitations, we omit the case details as well as the candidate lists for intent and slot labels.
  • ...and 2 more figures