Multimodal Relational Triple Extraction with Query-based Entity Object Transformer
Lei Hei, Ning An, Tingjing Liao, Qi Ma, Jiaqi Wang, Feiliang Ren
TL;DR
This work defines MORTE, a novel end-to-end task to extract all possible multimodal triples (entity span, relation, object region) from image-text pairs, addressing limitations of pipeline approaches that require pre-identified entities. It introduces QEOT, a query-based, DETR-like Transformer with selective attention and gated fusion to jointly perform entity extraction, relation classification, and object detection, optimized via a set-based Hungarian matching loss that aligns predictions to ground truth. The MORTE dataset, derived from MORE, contains 3,559 images, 1,681 entities, and 20,264 triples across 21 relations, enabling comprehensive evaluation of cross-modal triple extraction. Empirical results show QEOT achieving state-of-the-art performance on MORTE, significantly surpassing baselines and demonstrating the effectiveness of end-to-end multimodal fusion and multi-task learning for robust knowledge-base construction from multimodal content.
Abstract
Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge graphs. Recent studies focus on extracting the relation type with entity pairs present in different modalities, such as one entity in the text and another in the image. However, existing approaches require entities and objects given beforehand, which is costly and impractical. To address the limitation, we propose a novel task, Multimodal Entity-Object Relational Triple Extraction, which aims to extract all triples (entity span, relation, object region) from image-text pairs. To facilitate this study, we modified a multimodal relation extraction dataset MORE, which includes 21 relation types, to create a new dataset containing 20,264 triples, averaging 5.75 triples per image-text pair. Moreover, we propose QEOT, a query-based model with a selective attention mechanism, to dynamically explore the interaction and fusion of textual and visual information. In particular, the proposed method can simultaneously accomplish entity extraction, relation classification, and object detection with a set of queries. Our method is suitable for downstream applications and reduces error accumulation due to the pipeline-style approaches. Extensive experimental results demonstrate that our proposed method outperforms the existing baselines by 8.06% and achieves state-of-the-art performance.
