Hierarchical Aligned Multimodal Learning for NER on Tweet Posts

Peipei Liu; Hong Li; Yimo Ren; Jie Liu; Shuaizong Si; Hongsong Zhu; Limin Sun

Hierarchical Aligned Multimodal Learning for NER on Tweet Posts

Peipei Liu, Hong Li, Yimo Ren, Jie Liu, Shuaizong Si, Hongsong Zhu, Limin Sun

TL;DR

This work tackles NER on multimodal tweets by proposing HamLearning, a hierarchical, end-to-end framework that aligns text and image across multiple semantic levels. The method combines intra-modality encoding (text via BERT and vision via global, object-level, semantic, and spatial representations), dynamic text-image relevance measuring, and iterative cross-modal learning with cross-modal Transformers to refine multimodal word representations for decoding. Empirical results on Twitter2015 and Twitter2017 show state-of-the-art performance, with extensive ablations and analyses confirming the contributions of multi-level visual features, relevance-guided fusion, and iterative cross-modal interaction. The approach demonstrates robust generalization and resilience to noisy visual content, offering a practical route for improved MNER in real-world social media data.

Abstract

Mining structured knowledge from tweets using named entity recognition (NER) can be beneficial for many down stream applications such as recommendation and intention understanding. With tweet posts tending to be multimodal, multimodal named entity recognition (MNER) has attracted more attention. In this paper, we propose a novel approach, which can dynamically align the image and text sequence and achieve the multi-level cross-modal learning to augment textual word representation for MNER improvement. To be specific, our framework can be split into three main stages: the first stage focuses on intra-modality representation learning to derive the implicit global and local knowledge of each modality, the second evaluates the relevance between the text and its accompanying image and integrates different grained visual information based on the relevance, the third enforces semantic refinement via iterative cross-modal interactions and co-attention. We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.

Hierarchical Aligned Multimodal Learning for NER on Tweet Posts

TL;DR

Abstract

Paper Structure (22 sections, 18 equations, 5 figures, 6 tables)

This paper contains 22 sections, 18 equations, 5 figures, 6 tables.

Introduction
Related Work
Method
Intra-modality Learning
Text Encoding
Vision Encoding
Relevance Measuring
Inter-modality Learning
MNER Decoding
Experiments
Datasets
Implementation Details
Baselines
Main Results
Ablation Study
...and 7 more sections

Figures (5)

Figure 1: The samples for MNER task, where the named entities and their types are highlighted. a: fully relevant (explicit support information), b: partially relevant (implicit support information), c: entity irrelevant (no entities, no support).
Figure 2: The overview of our proposed method.
Figure 3: We have the relation between object $i$ (red region) and object $j$ (black region).
Figure 4: The changes of important indicators (i.e., Loss and F1) during the training process of our model.
Figure 5: The case comparisons of our model and others.

Hierarchical Aligned Multimodal Learning for NER on Tweet Posts

TL;DR

Abstract

Hierarchical Aligned Multimodal Learning for NER on Tweet Posts

Authors

TL;DR

Abstract

Table of Contents

Figures (5)