Table of Contents
Fetching ...

Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

Lei Hei, Ning An, Tingjing Liao, Qi Ma, Jiaqi Wang, Feiliang Ren

TL;DR

This work defines MORTE, a novel end-to-end task to extract all possible multimodal triples (entity span, relation, object region) from image-text pairs, addressing limitations of pipeline approaches that require pre-identified entities. It introduces QEOT, a query-based, DETR-like Transformer with selective attention and gated fusion to jointly perform entity extraction, relation classification, and object detection, optimized via a set-based Hungarian matching loss that aligns predictions to ground truth. The MORTE dataset, derived from MORE, contains 3,559 images, 1,681 entities, and 20,264 triples across 21 relations, enabling comprehensive evaluation of cross-modal triple extraction. Empirical results show QEOT achieving state-of-the-art performance on MORTE, significantly surpassing baselines and demonstrating the effectiveness of end-to-end multimodal fusion and multi-task learning for robust knowledge-base construction from multimodal content.

Abstract

Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge graphs. Recent studies focus on extracting the relation type with entity pairs present in different modalities, such as one entity in the text and another in the image. However, existing approaches require entities and objects given beforehand, which is costly and impractical. To address the limitation, we propose a novel task, Multimodal Entity-Object Relational Triple Extraction, which aims to extract all triples (entity span, relation, object region) from image-text pairs. To facilitate this study, we modified a multimodal relation extraction dataset MORE, which includes 21 relation types, to create a new dataset containing 20,264 triples, averaging 5.75 triples per image-text pair. Moreover, we propose QEOT, a query-based model with a selective attention mechanism, to dynamically explore the interaction and fusion of textual and visual information. In particular, the proposed method can simultaneously accomplish entity extraction, relation classification, and object detection with a set of queries. Our method is suitable for downstream applications and reduces error accumulation due to the pipeline-style approaches. Extensive experimental results demonstrate that our proposed method outperforms the existing baselines by 8.06% and achieves state-of-the-art performance.

Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

TL;DR

This work defines MORTE, a novel end-to-end task to extract all possible multimodal triples (entity span, relation, object region) from image-text pairs, addressing limitations of pipeline approaches that require pre-identified entities. It introduces QEOT, a query-based, DETR-like Transformer with selective attention and gated fusion to jointly perform entity extraction, relation classification, and object detection, optimized via a set-based Hungarian matching loss that aligns predictions to ground truth. The MORTE dataset, derived from MORE, contains 3,559 images, 1,681 entities, and 20,264 triples across 21 relations, enabling comprehensive evaluation of cross-modal triple extraction. Empirical results show QEOT achieving state-of-the-art performance on MORTE, significantly surpassing baselines and demonstrating the effectiveness of end-to-end multimodal fusion and multi-task learning for robust knowledge-base construction from multimodal content.

Abstract

Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge graphs. Recent studies focus on extracting the relation type with entity pairs present in different modalities, such as one entity in the text and another in the image. However, existing approaches require entities and objects given beforehand, which is costly and impractical. To address the limitation, we propose a novel task, Multimodal Entity-Object Relational Triple Extraction, which aims to extract all triples (entity span, relation, object region) from image-text pairs. To facilitate this study, we modified a multimodal relation extraction dataset MORE, which includes 21 relation types, to create a new dataset containing 20,264 triples, averaging 5.75 triples per image-text pair. Moreover, we propose QEOT, a query-based model with a selective attention mechanism, to dynamically explore the interaction and fusion of textual and visual information. In particular, the proposed method can simultaneously accomplish entity extraction, relation classification, and object detection with a set of queries. Our method is suitable for downstream applications and reduces error accumulation due to the pipeline-style approaches. Extensive experimental results demonstrate that our proposed method outperforms the existing baselines by 8.06% and achieves state-of-the-art performance.
Paper Structure (24 sections, 9 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 9 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: An example of multimodal relation extraction for entity pairs present in different modalities. Given the text and corresponding image, the MORE task only predicts the relation type between the existing entity and object. In contrast, MORTE must recognize entities, objects, and relations, plus extract all potential triples.
  • Figure 2: The overall query-based entity-object transformer architecture.
  • Figure 3: Visualization of selective attention of image to text.
  • Figure 4: Visualization of cross-attention of queries to image.
  • Figure 5: Detailed encoder-decoder architecture of the query-based transformer.
  • ...and 2 more figures