Table of Contents
Fetching ...

Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction

Qian Li, Cheng Ji, Shu Guo, Yong Zhao, Qianren Mao, Shangguang Wang, Yuntao Wei, Jianxin Li

TL;DR

The paper addresses multi-modal relation extraction (MMRE) in contexts where a single sentence-image pair contains multiple entity pairs with potentially different relations. It introduces VM-HAN, a framework that builds a multi-modal hypergraph per sentence, connecting textual entities with the image and detected objects via global, intra-modal, and inter-modal hyperedges. Node and hyperedge representations are learned under a variational scheme, modeling each as Gaussian distributions and updating them through variational hypergraph attention to capture high-order, cross-modal correlations. The model is trained with a joint objective combining relation classification, reconstruction, and KL regularization, and it achieves state-of-the-art results on MNRE and MORE while offering improved efficiency. The approach demonstrates that explicitly modeling hypergraph structures and distributional uncertainty can better disambiguate relations across multiple entity pairs in multimodal contexts, with practical implications for information extraction and cross-modal reasoning.

Abstract

Multi-modal relation extraction (MMRE) is a challenging task that aims to identify relations between entities in text leveraging image information. Existing methods are limited by their neglect of the multiple entity pairs in one sentence sharing very similar contextual information (ie, the same text and image), resulting in increased difficulty in the MMRE task. To address this limitation, we propose the Variational Multi-Modal Hypergraph Attention Network (VM-HAN) for multi-modal relation extraction. Specifically, we first construct a multi-modal hypergraph for each sentence with the corresponding image, to establish different high-order intra-/inter-modal correlations for different entity pairs in each sentence. We further design the Variational Hypergraph Attention Networks (V-HAN) to obtain representational diversity among different entity pairs using Gaussian distribution and learn a better hypergraph structure via variational attention. VM-HAN achieves state-of-the-art performance on the multi-modal relation extraction task, outperforming existing methods in terms of accuracy and efficiency.

Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction

TL;DR

The paper addresses multi-modal relation extraction (MMRE) in contexts where a single sentence-image pair contains multiple entity pairs with potentially different relations. It introduces VM-HAN, a framework that builds a multi-modal hypergraph per sentence, connecting textual entities with the image and detected objects via global, intra-modal, and inter-modal hyperedges. Node and hyperedge representations are learned under a variational scheme, modeling each as Gaussian distributions and updating them through variational hypergraph attention to capture high-order, cross-modal correlations. The model is trained with a joint objective combining relation classification, reconstruction, and KL regularization, and it achieves state-of-the-art results on MNRE and MORE while offering improved efficiency. The approach demonstrates that explicitly modeling hypergraph structures and distributional uncertainty can better disambiguate relations across multiple entity pairs in multimodal contexts, with practical implications for information extraction and cross-modal reasoning.

Abstract

Multi-modal relation extraction (MMRE) is a challenging task that aims to identify relations between entities in text leveraging image information. Existing methods are limited by their neglect of the multiple entity pairs in one sentence sharing very similar contextual information (ie, the same text and image), resulting in increased difficulty in the MMRE task. To address this limitation, we propose the Variational Multi-Modal Hypergraph Attention Network (VM-HAN) for multi-modal relation extraction. Specifically, we first construct a multi-modal hypergraph for each sentence with the corresponding image, to establish different high-order intra-/inter-modal correlations for different entity pairs in each sentence. We further design the Variational Hypergraph Attention Networks (V-HAN) to obtain representational diversity among different entity pairs using Gaussian distribution and learn a better hypergraph structure via variational attention. VM-HAN achieves state-of-the-art performance on the multi-modal relation extraction task, outperforming existing methods in terms of accuracy and efficiency.
Paper Structure (31 sections, 13 equations, 10 figures, 2 tables)

This paper contains 31 sections, 13 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: An example of the MMRE task. The task is to predict the relation of given entity pairs for the specific text and image which contains multiple objects.
  • Figure 2: VM-HAN models text and corresponding images into a hypergraph for capturing high-order correlations and further learns entity pair representation under Gaussian distribution for robust nodes and hyperedges learning.
  • Figure 3: Different proportions of visual information on MNRE. "Without Image" means deleting all images. "Using 50% Image" means randomly deleting 50% images.
  • Figure 4: Impact of differences in sample number on MNRE. It means the performance (F1) when an entity belongs to one or multiple entity types.
  • Figure 5: Impact of relation numbers for each sentence.
  • ...and 5 more figures