Table of Contents
Fetching ...

TECO: Improving Multimodal Intent Recognition with Text Enhancement through Commonsense Knowledge Extraction

Quynh-Mai Thi Nguyen, Lan-Nhi Thi Nguyen, Cam-Van Thi Nguyen

TL;DR

TECO addresses the challenge of multimodal intent recognition by enriching textual representations with commonsense knowledge extracted from both generative (COMET) and retrieved (SBERT) sources, focusing on relation types xReact and xWant. It introduces a Textual Enhancement Module (TEM) that fuses dual-perspective relation features and a Multimodal Alignment Fusion (MAF) to coherently align text with visual and acoustic modalities. The approach is evaluated on the MIntRec dataset, where TECO outperforms strong baselines in both multi-class and binary classification settings, with ablations validating the contribution of each component, particularly the TEM and MAF modules. The work demonstrates that incorporating structured commonsense knowledge into text and carefully aligning modalities yields significant gains in intent recognition, with practical implications for more context-aware dialogue systems and multimodal understanding.

Abstract

The objective of multimodal intent recognition (MIR) is to leverage various modalities-such as text, video, and audio-to detect user intentions, which is crucial for understanding human language and context in dialogue systems. Despite advances in this field, two main challenges persist: (1) effectively extracting and utilizing semantic information from robust textual features; (2) aligning and fusing non-verbal modalities with verbal ones effectively. This paper proposes a Text Enhancement with CommOnsense Knowledge Extractor (TECO) to address these challenges. We begin by extracting relations from both generated and retrieved knowledge to enrich the contextual information in the text modality. Subsequently, we align and integrate visual and acoustic representations with these enhanced text features to form a cohesive multimodal representation. Our experimental results show substantial improvements over existing baseline methods.

TECO: Improving Multimodal Intent Recognition with Text Enhancement through Commonsense Knowledge Extraction

TL;DR

TECO addresses the challenge of multimodal intent recognition by enriching textual representations with commonsense knowledge extracted from both generative (COMET) and retrieved (SBERT) sources, focusing on relation types xReact and xWant. It introduces a Textual Enhancement Module (TEM) that fuses dual-perspective relation features and a Multimodal Alignment Fusion (MAF) to coherently align text with visual and acoustic modalities. The approach is evaluated on the MIntRec dataset, where TECO outperforms strong baselines in both multi-class and binary classification settings, with ablations validating the contribution of each component, particularly the TEM and MAF modules. The work demonstrates that incorporating structured commonsense knowledge into text and carefully aligning modalities yields significant gains in intent recognition, with practical implications for more context-aware dialogue systems and multimodal understanding.

Abstract

The objective of multimodal intent recognition (MIR) is to leverage various modalities-such as text, video, and audio-to detect user intentions, which is crucial for understanding human language and context in dialogue systems. Despite advances in this field, two main challenges persist: (1) effectively extracting and utilizing semantic information from robust textual features; (2) aligning and fusing non-verbal modalities with verbal ones effectively. This paper proposes a Text Enhancement with CommOnsense Knowledge Extractor (TECO) to address these challenges. We begin by extracting relations from both generated and retrieved knowledge to enrich the contextual information in the text modality. Subsequently, we align and integrate visual and acoustic representations with these enhanced text features to form a cohesive multimodal representation. Our experimental results show substantial improvements over existing baseline methods.

Paper Structure

This paper contains 20 sections, 15 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An example of integrating commonsense knowledge for multi-intent recognition provides awareness about implicit context which relates to the utterance's intention.
  • Figure 2: Overall architecture of our model is illustrated in the left part. The lower right part describes the flow of the Commonsense Knowledge Extractor (COKE), and the upper one shows details of the Text Enhancement Module (TEM), which integrates relation features into textual representations using commonsense knowledge and a dual perspective learning module.
  • Figure 3: Model performance across different value of $\gamma$