Table of Contents
Fetching ...

Few-Shot Relation Extraction with Hybrid Visual Evidence

Jiaying Gong, Hoda Eldardiry

TL;DR

In the $N$-way-$K$-shot setting, limited labeled data and sparse textual context hinder relation prediction. The authors propose MFS-HVE, a multi-modal framework that jointly learns textual and visual representations and fuses them with image-guided, object-guided, and hybrid feature attention, using a cross-modality encoder and a prototypical classifier with hyperbolic distance $d(\,\cdot\,)$.$P_{multi}(S)$ prototypes are computed as the mean of multi-modal support embeddings $L_{multi}$ and used to classify queries via the hyperbolic distance. The study shows substantial gains over text-only and other fusion baselines on MNRE and FewRelsmall, with ablations confirming the necessity of each attention stream and the hybrid fusion. The results indicate that semantic visual information can compensate for missing textual cues in low-resource relation extraction, offering a practical path toward robust multi-modal few-shot learning in real-world settings.

Abstract

The goal of few-shot relation extraction is to predict relations between name entities in a sentence when only a few labeled instances are available for training. Existing few-shot relation extraction methods focus on uni-modal information such as text only. This reduces performance when there are no clear contexts between the name entities described in text. We propose a multi-modal few-shot relation extraction model (MFS-HVE) that leverages both textual and visual semantic information to learn a multi-modal representation jointly. The MFS-HVE includes semantic feature extractors and multi-modal fusion components. The MFS-HVE semantic feature extractors are developed to extract both textual and visual features. The visual features include global image features and local object features within the image. The MFS-HVE multi-modal fusion unit integrates information from various modalities using image-guided attention, object-guided attention, and hybrid feature attention to fully capture the semantic interaction between visual regions of images and relevant texts. Extensive experiments conducted on two public datasets demonstrate that semantic visual information significantly improves the performance of few-shot relation prediction.

Few-Shot Relation Extraction with Hybrid Visual Evidence

TL;DR

In the -way--shot setting, limited labeled data and sparse textual context hinder relation prediction. The authors propose MFS-HVE, a multi-modal framework that jointly learns textual and visual representations and fuses them with image-guided, object-guided, and hybrid feature attention, using a cross-modality encoder and a prototypical classifier with hyperbolic distance . prototypes are computed as the mean of multi-modal support embeddings and used to classify queries via the hyperbolic distance. The study shows substantial gains over text-only and other fusion baselines on MNRE and FewRelsmall, with ablations confirming the necessity of each attention stream and the hybrid fusion. The results indicate that semantic visual information can compensate for missing textual cues in low-resource relation extraction, offering a practical path toward robust multi-modal few-shot learning in real-world settings.

Abstract

The goal of few-shot relation extraction is to predict relations between name entities in a sentence when only a few labeled instances are available for training. Existing few-shot relation extraction methods focus on uni-modal information such as text only. This reduces performance when there are no clear contexts between the name entities described in text. We propose a multi-modal few-shot relation extraction model (MFS-HVE) that leverages both textual and visual semantic information to learn a multi-modal representation jointly. The MFS-HVE includes semantic feature extractors and multi-modal fusion components. The MFS-HVE semantic feature extractors are developed to extract both textual and visual features. The visual features include global image features and local object features within the image. The MFS-HVE multi-modal fusion unit integrates information from various modalities using image-guided attention, object-guided attention, and hybrid feature attention to fully capture the semantic interaction between visual regions of images and relevant texts. Extensive experiments conducted on two public datasets demonstrate that semantic visual information significantly improves the performance of few-shot relation prediction.
Paper Structure (34 sections, 15 equations, 5 figures, 5 tables)

This paper contains 34 sections, 15 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An example of multi-modal relation extraction based on visual information.
  • Figure 2: The overview of MFS-HVE. Details of multi-modal fusion is introduced in Sec. \ref{['sec:fusion']} and Figure \ref{['fig:fusion']}
  • Figure 3: Detailed structure of multi-modal fusion.
  • Figure 4: The examples of our proposed model MFS-HVE comparing to a text-based model on both the MNRE and FewRel datasets. We present the relation extraction results with the detected objects from the relevant image in the right column. The head entities are highlighted in green, whereas the tail entities are highlighted in red.
  • Figure 5: Effects on varying the number of embedded objects in one-shot settings on MNRE and FewRelsmall datasets.