Table of Contents
Fetching ...

Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model

Li Yuan, Yi Cai, Junsheng Huang

TL;DR

The paper tackles few-shot joint multimodal entity-relation extraction by introducing KECPM, a two-stage framework that harvests auxiliary background knowledge from a large language model through dynamic, similarity-guided prompts and self-reflection. The Knowledge Ingestion stage produces contextually relevant knowledge, which is then fused with the original input in a Knowledge-enhanced LM to output JMERE quintuples, using a structured input format and a language-model objective. Extensive experiments on FS-JMERE datasets show that KECPM outperforms strong unimodal and multimodal baselines in both micro and macro F1 scores, with ablations validating the benefits of prompt selection and knowledge refinement. The approach offers a practical path to robust JMERE in low-data regimes and suggests broader applicability to other multimodal information extraction tasks.

Abstract

Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task that aims to extract entities and their relations from text-image pairs in social media posts. Existing methods for JMERE require large amounts of labeled data. However, gathering and annotating fine-grained multimodal data for JMERE poses significant challenges. Initially, we construct diverse and comprehensive multimodal few-shot datasets fitted to the original data distribution. To address the insufficient information in the few-shot setting, we introduce the \textbf{K}nowledge-\textbf{E}nhanced \textbf{C}ross-modal \textbf{P}rompt \textbf{M}odel (KECPM) for JMERE. This method can effectively address the problem of insufficient information in the few-shot setting by guiding a large language model to generate supplementary background knowledge. Our proposed method comprises two stages: (1) a knowledge ingestion stage that dynamically formulates prompts based on semantic similarity guide ChatGPT generating relevant knowledge and employs self-reflection to refine the knowledge; (2) a knowledge-enhanced language model stage that merges the auxiliary knowledge with the original input and utilizes a transformer-based model to align with JMERE's required output format. We extensively evaluate our approach on a few-shot dataset derived from the JMERE dataset, demonstrating its superiority over strong baselines in terms of both micro and macro F$_1$ scores. Additionally, we present qualitative analyses and case studies to elucidate the effectiveness of our model.

Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model

TL;DR

The paper tackles few-shot joint multimodal entity-relation extraction by introducing KECPM, a two-stage framework that harvests auxiliary background knowledge from a large language model through dynamic, similarity-guided prompts and self-reflection. The Knowledge Ingestion stage produces contextually relevant knowledge, which is then fused with the original input in a Knowledge-enhanced LM to output JMERE quintuples, using a structured input format and a language-model objective. Extensive experiments on FS-JMERE datasets show that KECPM outperforms strong unimodal and multimodal baselines in both micro and macro F1 scores, with ablations validating the benefits of prompt selection and knowledge refinement. The approach offers a practical path to robust JMERE in low-data regimes and suggests broader applicability to other multimodal information extraction tasks.

Abstract

Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task that aims to extract entities and their relations from text-image pairs in social media posts. Existing methods for JMERE require large amounts of labeled data. However, gathering and annotating fine-grained multimodal data for JMERE poses significant challenges. Initially, we construct diverse and comprehensive multimodal few-shot datasets fitted to the original data distribution. To address the insufficient information in the few-shot setting, we introduce the \textbf{K}nowledge-\textbf{E}nhanced \textbf{C}ross-modal \textbf{P}rompt \textbf{M}odel (KECPM) for JMERE. This method can effectively address the problem of insufficient information in the few-shot setting by guiding a large language model to generate supplementary background knowledge. Our proposed method comprises two stages: (1) a knowledge ingestion stage that dynamically formulates prompts based on semantic similarity guide ChatGPT generating relevant knowledge and employs self-reflection to refine the knowledge; (2) a knowledge-enhanced language model stage that merges the auxiliary knowledge with the original input and utilizes a transformer-based model to align with JMERE's required output format. We extensively evaluate our approach on a few-shot dataset derived from the JMERE dataset, demonstrating its superiority over strong baselines in terms of both micro and macro F scores. Additionally, we present qualitative analyses and case studies to elucidate the effectiveness of our model.

Paper Structure

This paper contains 25 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of two different ways of acquiring auxiliary knowledge.
  • Figure 2: An example of self-reflection within ChatGPT to correct its response.
  • Figure 3: The KECPM architecture for few-shot JMERE comprises two stages: (a) The Knowledge Ingestion stage generates auxiliary knowledge from ChatGPT, pertinent to the provided text and image. This enhances the downstream model's contextual comprehension; (b) The Knowledge-enhanced LM combines the auxiliary knowledge with the original input, feeding it into a language model to address the few-shot JMERE task.
  • Figure 4: An illustrative instance of self-reflection within ChatGPT reveals how this self-reflection process can occasionally yield unhelpful responses.
  • Figure 5: The impact of iterations of self-reflection (N)
  • ...and 3 more figures