Few-Shot Relation Extraction with Hybrid Visual Evidence
Jiaying Gong, Hoda Eldardiry
TL;DR
In the $N$-way-$K$-shot setting, limited labeled data and sparse textual context hinder relation prediction. The authors propose MFS-HVE, a multi-modal framework that jointly learns textual and visual representations and fuses them with image-guided, object-guided, and hybrid feature attention, using a cross-modality encoder and a prototypical classifier with hyperbolic distance $d(\,\cdot\,)$.$P_{multi}(S)$ prototypes are computed as the mean of multi-modal support embeddings $L_{multi}$ and used to classify queries via the hyperbolic distance. The study shows substantial gains over text-only and other fusion baselines on MNRE and FewRelsmall, with ablations confirming the necessity of each attention stream and the hybrid fusion. The results indicate that semantic visual information can compensate for missing textual cues in low-resource relation extraction, offering a practical path toward robust multi-modal few-shot learning in real-world settings.
Abstract
The goal of few-shot relation extraction is to predict relations between name entities in a sentence when only a few labeled instances are available for training. Existing few-shot relation extraction methods focus on uni-modal information such as text only. This reduces performance when there are no clear contexts between the name entities described in text. We propose a multi-modal few-shot relation extraction model (MFS-HVE) that leverages both textual and visual semantic information to learn a multi-modal representation jointly. The MFS-HVE includes semantic feature extractors and multi-modal fusion components. The MFS-HVE semantic feature extractors are developed to extract both textual and visual features. The visual features include global image features and local object features within the image. The MFS-HVE multi-modal fusion unit integrates information from various modalities using image-guided attention, object-guided attention, and hybrid feature attention to fully capture the semantic interaction between visual regions of images and relevant texts. Extensive experiments conducted on two public datasets demonstrate that semantic visual information significantly improves the performance of few-shot relation prediction.
