Table of Contents
Fetching ...

A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition

Yaomin Shen, Xiaojian Lin, Wei Fan

TL;DR

The paper tackles multimodal intent recognition by aligning heterogeneous signals (text, audio, video) with semantic descriptions of intents. It introduces A-MESS, combining an Anchor-based Multimodal Embedding (A-ME) with a Semantic Synchronization (SS) framework that uses Triplet Contrastive Learning and LLM-generated label descriptions to train richer, semantically grounded representations. Experiments on the MintRec and MintRec2.0 datasets demonstrate state-of-the-art performance and emphasize the value of semantic space alignment for robust MIR, including out-of-scope detection. The work advances multimodal representation learning by explicitly leveraging anchor-based fusion and LLM-informed semantics to improve both in-scope accuracy and generalization to unseen intents.

Abstract

In the domain of multimodal intent recognition (MIR), the objective is to recognize human intent by integrating a variety of modalities, such as language text, body gestures, and tones. However, existing approaches face difficulties adequately capturing the intrinsic connections between the modalities and overlooking the corresponding semantic representations of intent. To address these limitations, we present the Anchor-based Multimodal Embedding with Semantic Synchronization (A-MESS) framework. We first design an Anchor-based Multimodal Embedding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs. Furthermore, we develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimizes the process by synchronizing multimodal representation with label descriptions produced by the large language model. Comprehensive experiments indicate that our A-MESS achieves state-of-the-art and provides substantial insight into multimodal representation and downstream tasks.

A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition

TL;DR

The paper tackles multimodal intent recognition by aligning heterogeneous signals (text, audio, video) with semantic descriptions of intents. It introduces A-MESS, combining an Anchor-based Multimodal Embedding (A-ME) with a Semantic Synchronization (SS) framework that uses Triplet Contrastive Learning and LLM-generated label descriptions to train richer, semantically grounded representations. Experiments on the MintRec and MintRec2.0 datasets demonstrate state-of-the-art performance and emphasize the value of semantic space alignment for robust MIR, including out-of-scope detection. The work advances multimodal representation learning by explicitly leveraging anchor-based fusion and LLM-informed semantics to improve both in-scope accuracy and generalization to unseen intents.

Abstract

In the domain of multimodal intent recognition (MIR), the objective is to recognize human intent by integrating a variety of modalities, such as language text, body gestures, and tones. However, existing approaches face difficulties adequately capturing the intrinsic connections between the modalities and overlooking the corresponding semantic representations of intent. To address these limitations, we present the Anchor-based Multimodal Embedding with Semantic Synchronization (A-MESS) framework. We first design an Anchor-based Multimodal Embedding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs. Furthermore, we develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimizes the process by synchronizing multimodal representation with label descriptions produced by the large language model. Comprehensive experiments indicate that our A-MESS achieves state-of-the-art and provides substantial insight into multimodal representation and downstream tasks.

Paper Structure

This paper contains 21 sections, 17 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The architecture of the (A-MESS) framework. Illustration of the others (left) frameworks employed in the majority of preceding studies, and ours (right) performs synchronization of the description generated by the Large Language Model.
  • Figure 2: Overview of the (A-MESS) architecture, within the Multimodal Auxiliary Text Representation component (top), we initially feed the feature embeddings of audio, images, text into the (A-ME) module for auxiliary enhancing text embedding. The generated embeddings are then concatenate and feed into multimodal encoder to achieve multimodal integration. In the Semantic Synchronization phase (bottom), we encode multiple description sentences generated by LLM into embeddings, then through triplet contrastive learning with the previously obtained multimodal embeddings. Finally, these embeddings are fed back into the Multimodal Auxiliary Text Representation component for classification computation.
  • Figure 3: Architecture of (A-ME) module. Firstly algin sequence length and dimensions, and then select anchors with top-k ratio. Feed the anchors into the anchor cross attention and feedforward(FF). Finally, the anchor-enhanced embeddings are fed into the temporal cross attention to obtain the final results.
  • Figure 4: Analysis of anchor number performance in MintReczhang2022mintrec (left) and MintRec 2.0zhang2024mintrec (right), where blue lines represent ACC, and red lines represent F1 score. When the number of anchors is 50, it means that no anchor is selected and all multimodal tokens are used.
  • Figure 5: Analysis of semanctic, $\mathbf{T}_f^{mean}$ in blue and $\mathbf{T}_f^{SS}$ in green, The dotted line indicates the semantic plane of the label.