Table of Contents
Fetching ...

Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition

Jielong Tang, Zhenxing Wang, Ziyang Gong, Jianxing Yu, Xiangwei Zhu, Jian Yin

TL;DR

This paper tackles Grounded Multimodal Named Entity Recognition by addressing both intra-entity and inter-entity relationships. It introduces MQSPN, a unified framework that uses a learnable Multi-grained Query Set (MQS) to capture fine-grained intra-entity semantics and a Multimodal Set Prediction Network (MSP) to predict a set of span-type-region quadruples via global optimal matching, facilitated by a Query-guided Fusion Net (QFNet). The model achieves state-of-the-art results on two Twitter GMNER benchmarks, with ablations showing the importance of learnable queries, fusion mechanisms, and bipartite matching loss. The approach reduces reliance on autoregressive decoding and enhances robustness to irrelevant visual regions, offering a scalable and efficient solution for joint extraction and grounding in multimodal content.

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentence-image pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utilizing human-designed type queries, struggles to differentiate ambiguous entities, such as Jordan (Person) and off-White x Jordan (Shoes). The latter, following the one-by-one decoding order, suffers from exposure bias issues. We maintain that these works misunderstand the relationships of multimodal entities. To tackle these, we propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels. Specifically, MQSPN explicitly aligns textual entities with visual regions by employing a set of learnable queries to strengthen intra-entity connections. Based on distinct intra-entity modeling, MQSPN reformulates GMNER as a set prediction, guiding models to establish appropriate inter-entity relationships from a optimal global matching perspective. Additionally, we incorporate a query-guided Fusion Net (QFNet) as a glue network to boost better alignment of two-level relationships. Extensive experiments demonstrate that our approach achieves state-of-the-art performances in widely used benchmarks.

Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition

TL;DR

This paper tackles Grounded Multimodal Named Entity Recognition by addressing both intra-entity and inter-entity relationships. It introduces MQSPN, a unified framework that uses a learnable Multi-grained Query Set (MQS) to capture fine-grained intra-entity semantics and a Multimodal Set Prediction Network (MSP) to predict a set of span-type-region quadruples via global optimal matching, facilitated by a Query-guided Fusion Net (QFNet). The model achieves state-of-the-art results on two Twitter GMNER benchmarks, with ablations showing the importance of learnable queries, fusion mechanisms, and bipartite matching loss. The approach reduces reliance on autoregressive decoding and enhances robustness to irrelevant visual regions, offering a scalable and efficient solution for joint extraction and grounding in multimodal content.

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentence-image pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utilizing human-designed type queries, struggles to differentiate ambiguous entities, such as Jordan (Person) and off-White x Jordan (Shoes). The latter, following the one-by-one decoding order, suffers from exposure bias issues. We maintain that these works misunderstand the relationships of multimodal entities. To tackle these, we propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels. Specifically, MQSPN explicitly aligns textual entities with visual regions by employing a set of learnable queries to strengthen intra-entity connections. Based on distinct intra-entity modeling, MQSPN reformulates GMNER as a set prediction, guiding models to establish appropriate inter-entity relationships from a optimal global matching perspective. Additionally, we incorporate a query-guided Fusion Net (QFNet) as a glue network to boost better alignment of two-level relationships. Extensive experiments demonstrate that our approach achieves state-of-the-art performances in widely used benchmarks.
Paper Structure (24 sections, 19 equations, 7 figures, 6 tables)

This paper contains 24 sections, 19 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The comparison of existing approaches and our MQSPN. (a) The MRC-based methods trap in entity-ambiguous issue due to intra-entity misunderstanding. (b) The sequence generation-based methods suffer from exposure bias issue due to inter-entity overreliance. (c) Our MQSPN model appropriate two-level relationships with learnable query set and set prediction.
  • Figure 2: (a). Overview of our MQSPN. (b). The construction of Multi-grained Query Set (MQS) consists of Type-grained Queries and Learnable Entity-grained Queries (LEQ). (c). The detailed architecture of Query-guided Fusion Net (QFNet).
  • Figure 3: Analysis of candidate visual regions number $k$ for H-Index, TIGER, and our MQSPN.
  • Figure 4: Analysis of the multi-grained queries quantity $u$ on GMNER, MNER and EEG tasks.
  • Figure 5: Predictions of MQSPN in different ablation setting.
  • ...and 2 more figures