Table of Contents
Fetching ...

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Qianrui Zhou, Hua Xu, Hao Li, Hanlei Zhang, Xiaohan Zhang, Yifan Wang, Kai Gao

TL;DR

This work tackles multimodal intent recognition by jointly exploiting text, video, and audio cues. It introduces a modality-aware prompting (MAP) module to align and fuse multimodal features and a token-level contrastive learning (TCL) framework that uses ground-truth label tokens within augmented text prompts to supervise nonverbal modalities, optimized via $NT-Xent$ loss. The approach achieves state-of-the-art results on MIntRec and MELD-DA, with extensive ablations confirming the superiority of modality-aware prompts over handcrafted prompts and the effectiveness of token-level contrastive learning. The proposed TCL-MAP design advances multimodal prompt learning and offers practical improvements for robust real-world intent understanding across diverse modalities.

Abstract

Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP.

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

TL;DR

This work tackles multimodal intent recognition by jointly exploiting text, video, and audio cues. It introduces a modality-aware prompting (MAP) module to align and fuse multimodal features and a token-level contrastive learning (TCL) framework that uses ground-truth label tokens within augmented text prompts to supervise nonverbal modalities, optimized via loss. The approach achieves state-of-the-art results on MIntRec and MELD-DA, with extensive ablations confirming the superiority of modality-aware prompts over handcrafted prompts and the effectiveness of token-level contrastive learning. The proposed TCL-MAP design advances multimodal prompt learning and offers practical improvements for robust real-world intent understanding across diverse modalities.

Abstract

Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP.
Paper Structure (30 sections, 15 equations, 3 figures, 3 tables)

This paper contains 30 sections, 15 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The overview architecture of TCL-MAP. In the Prompt-Based Augmentation module, we first create the modality-aware prompt using multimodal features, and then concatenate text tokens, prompt tokens and [MASK]/Label token to construct augmented pair. In the Representation Learning module, we extract the refined tokens for classification and conduct contrastive learning between the [MASK] token and the Label token.
  • Figure 2: The details of Modality-Aware Prompting (MAP) module. We align multimodal features based on the content by computing the similarity matrix as weights and enhance correlations between modalities through a cross-modality transformer to create the modality-aware prompt.
  • Figure 3: The comparison between Handcraft Prompt and Modality-Aware Prompt