Table of Contents
Fetching ...

Hierarchical Aligned Multimodal Learning for NER on Tweet Posts

Peipei Liu, Hong Li, Yimo Ren, Jie Liu, Shuaizong Si, Hongsong Zhu, Limin Sun

TL;DR

This work tackles NER on multimodal tweets by proposing HamLearning, a hierarchical, end-to-end framework that aligns text and image across multiple semantic levels. The method combines intra-modality encoding (text via BERT and vision via global, object-level, semantic, and spatial representations), dynamic text-image relevance measuring, and iterative cross-modal learning with cross-modal Transformers to refine multimodal word representations for decoding. Empirical results on Twitter2015 and Twitter2017 show state-of-the-art performance, with extensive ablations and analyses confirming the contributions of multi-level visual features, relevance-guided fusion, and iterative cross-modal interaction. The approach demonstrates robust generalization and resilience to noisy visual content, offering a practical route for improved MNER in real-world social media data.

Abstract

Mining structured knowledge from tweets using named entity recognition (NER) can be beneficial for many down stream applications such as recommendation and intention understanding. With tweet posts tending to be multimodal, multimodal named entity recognition (MNER) has attracted more attention. In this paper, we propose a novel approach, which can dynamically align the image and text sequence and achieve the multi-level cross-modal learning to augment textual word representation for MNER improvement. To be specific, our framework can be split into three main stages: the first stage focuses on intra-modality representation learning to derive the implicit global and local knowledge of each modality, the second evaluates the relevance between the text and its accompanying image and integrates different grained visual information based on the relevance, the third enforces semantic refinement via iterative cross-modal interactions and co-attention. We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.

Hierarchical Aligned Multimodal Learning for NER on Tweet Posts

TL;DR

This work tackles NER on multimodal tweets by proposing HamLearning, a hierarchical, end-to-end framework that aligns text and image across multiple semantic levels. The method combines intra-modality encoding (text via BERT and vision via global, object-level, semantic, and spatial representations), dynamic text-image relevance measuring, and iterative cross-modal learning with cross-modal Transformers to refine multimodal word representations for decoding. Empirical results on Twitter2015 and Twitter2017 show state-of-the-art performance, with extensive ablations and analyses confirming the contributions of multi-level visual features, relevance-guided fusion, and iterative cross-modal interaction. The approach demonstrates robust generalization and resilience to noisy visual content, offering a practical route for improved MNER in real-world social media data.

Abstract

Mining structured knowledge from tweets using named entity recognition (NER) can be beneficial for many down stream applications such as recommendation and intention understanding. With tweet posts tending to be multimodal, multimodal named entity recognition (MNER) has attracted more attention. In this paper, we propose a novel approach, which can dynamically align the image and text sequence and achieve the multi-level cross-modal learning to augment textual word representation for MNER improvement. To be specific, our framework can be split into three main stages: the first stage focuses on intra-modality representation learning to derive the implicit global and local knowledge of each modality, the second evaluates the relevance between the text and its accompanying image and integrates different grained visual information based on the relevance, the third enforces semantic refinement via iterative cross-modal interactions and co-attention. We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.
Paper Structure (22 sections, 18 equations, 5 figures, 6 tables)

This paper contains 22 sections, 18 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The samples for MNER task, where the named entities and their types are highlighted. a: fully relevant (explicit support information), b: partially relevant (implicit support information), c: entity irrelevant (no entities, no support).
  • Figure 2: The overview of our proposed method.
  • Figure 3: We have the relation between object $i$ (red region) and object $j$ (black region).
  • Figure 4: The changes of important indicators (i.e., Loss and F1) during the training process of our model.
  • Figure 5: The case comparisons of our model and others.