Table of Contents
Fetching ...

A BiRGAT Model for Multi-intent Spoken Language Understanding with Hierarchical Semantic Frames

Hongshen Xu, Ruisheng Cao, Su Zhu, Sheng Jiang, Hanchong Zhang, Lu Chen, Kai Yu

TL;DR

This work tackles multi-intent spoken language understanding by introducing the MIVS dataset, which uses a 3-layer hierarchical semantic frame (domain → intent → slot) to represent complex user utterances in in-vehicle contexts. It proposes BiRGAT, a dual Relational Graph Attention Network-based encoder that jointly encodes question words and ontology items, coupled with a 3-way pointer-generator decoder that can generate, copy from the question, or select ontology items to form the output tree. Comprehensive experiments on MIVS and TOPv2 show BiRGAT outperforms traditional sequence labeling and classification approaches, with ablations confirming the value of ontology-aware embeddings, hierarchical relations, and cross-attention. The results demonstrate strong performance gains in multi-domain, multi-intent SLU and offer insights into few-shot transfer and limitations of large language models for structured semantic output in low-resource settings.

Abstract

Previous work on spoken language understanding (SLU) mainly focuses on single-intent settings, where each input utterance merely contains one user intent. This configuration significantly limits the surface form of user utterances and the capacity of output semantics. In this work, we first propose a Multi-Intent dataset which is collected from a realistic in-Vehicle dialogue System, called MIVS. The target semantic frame is organized in a 3-layer hierarchical structure to tackle the alignment and assignment problems in multi-intent cases. Accordingly, we devise a BiRGAT model to encode the hierarchy of ontology items, the backbone of which is a dual relational graph attention network. Coupled with the 3-way pointer-generator decoder, our method outperforms traditional sequence labeling and classification-based schemes by a large margin.

A BiRGAT Model for Multi-intent Spoken Language Understanding with Hierarchical Semantic Frames

TL;DR

This work tackles multi-intent spoken language understanding by introducing the MIVS dataset, which uses a 3-layer hierarchical semantic frame (domain → intent → slot) to represent complex user utterances in in-vehicle contexts. It proposes BiRGAT, a dual Relational Graph Attention Network-based encoder that jointly encodes question words and ontology items, coupled with a 3-way pointer-generator decoder that can generate, copy from the question, or select ontology items to form the output tree. Comprehensive experiments on MIVS and TOPv2 show BiRGAT outperforms traditional sequence labeling and classification approaches, with ablations confirming the value of ontology-aware embeddings, hierarchical relations, and cross-attention. The results demonstrate strong performance gains in multi-domain, multi-intent SLU and offer insights into few-shot transfer and limitations of large language models for structured semantic output in low-resource settings.

Abstract

Previous work on spoken language understanding (SLU) mainly focuses on single-intent settings, where each input utterance merely contains one user intent. This configuration significantly limits the surface form of user utterances and the capacity of output semantics. In this work, we first propose a Multi-Intent dataset which is collected from a realistic in-Vehicle dialogue System, called MIVS. The target semantic frame is organized in a 3-layer hierarchical structure to tackle the alignment and assignment problems in multi-intent cases. Accordingly, we devise a BiRGAT model to encode the hierarchy of ontology items, the backbone of which is a dual relational graph attention network. Coupled with the 3-way pointer-generator decoder, our method outperforms traditional sequence labeling and classification-based schemes by a large margin.
Paper Structure (16 sections, 5 equations, 3 figures, 3 tables)

This paper contains 16 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: A multi-intent example from MIVS dataset.
  • Figure 2: An overview of the BiRGAT model architecture.
  • Figure 3: Few-shot learning experiments when transferring to more intents ($>3$) in domain " in-vehicle control". Due to the max token limit, prompts of LLM are truncated to at most $10$ exemplars.