Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction
Qian Li, Cheng Ji, Shu Guo, Yong Zhao, Qianren Mao, Shangguang Wang, Yuntao Wei, Jianxin Li
TL;DR
The paper addresses multi-modal relation extraction (MMRE) in contexts where a single sentence-image pair contains multiple entity pairs with potentially different relations. It introduces VM-HAN, a framework that builds a multi-modal hypergraph per sentence, connecting textual entities with the image and detected objects via global, intra-modal, and inter-modal hyperedges. Node and hyperedge representations are learned under a variational scheme, modeling each as Gaussian distributions and updating them through variational hypergraph attention to capture high-order, cross-modal correlations. The model is trained with a joint objective combining relation classification, reconstruction, and KL regularization, and it achieves state-of-the-art results on MNRE and MORE while offering improved efficiency. The approach demonstrates that explicitly modeling hypergraph structures and distributional uncertainty can better disambiguate relations across multiple entity pairs in multimodal contexts, with practical implications for information extraction and cross-modal reasoning.
Abstract
Multi-modal relation extraction (MMRE) is a challenging task that aims to identify relations between entities in text leveraging image information. Existing methods are limited by their neglect of the multiple entity pairs in one sentence sharing very similar contextual information (ie, the same text and image), resulting in increased difficulty in the MMRE task. To address this limitation, we propose the Variational Multi-Modal Hypergraph Attention Network (VM-HAN) for multi-modal relation extraction. Specifically, we first construct a multi-modal hypergraph for each sentence with the corresponding image, to establish different high-order intra-/inter-modal correlations for different entity pairs in each sentence. We further design the Variational Hypergraph Attention Networks (V-HAN) to obtain representational diversity among different entity pairs using Gaussian distribution and learn a better hypergraph structure via variational attention. VM-HAN achieves state-of-the-art performance on the multi-modal relation extraction task, outperforming existing methods in terms of accuracy and efficiency.
