Table of Contents
Fetching ...

M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction

Jiang Liu, Bobo Li, Xinran Yang, Na Yang, Hao Fei, Mingyao Zhang, Fei Li, Donghong Ji

TL;DR

M$^{3}$D introduces a document-level multimodal IE benchmark spanning English and Chinese video-text data to support NER, entity chains, RE, and VG. It couples a hierarchical fusion model with a Denoised Feature Fusion Module (DFFM) and a Missing Modality Construction Module (MMCM) to robustly integrate modalities and handle missing data. Empirical results show the approach achieves a strong cross-task average performance (~53.8% EN, ~53.8% CH) and gains over baseline methods, with ablations confirming the contributions of both modules. The dataset and model design advance grounded multimodal IE for diverse domains, enabling future research in multilingual, video-grounded information extraction.

Abstract

Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M$^{3}$D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.

M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction

TL;DR

MD introduces a document-level multimodal IE benchmark spanning English and Chinese video-text data to support NER, entity chains, RE, and VG. It couples a hierarchical fusion model with a Denoised Feature Fusion Module (DFFM) and a Missing Modality Construction Module (MMCM) to robustly integrate modalities and handle missing data. Empirical results show the approach achieves a strong cross-task average performance (~53.8% EN, ~53.8% CH) and gains over baseline methods, with ablations confirming the contributions of both modules. The dataset and model design advance grounded multimodal IE for diverse domains, enabling future research in multilingual, video-grounded information extraction.

Abstract

Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named MD, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.

Paper Structure

This paper contains 32 sections, 24 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: A sample in the M$^{3}$D dataset. The left part is the input example, and the right part is the output example of four tasks.
  • Figure 2: The overall construction process of the M$^3$D dataset. Step 1: Crawl videos from video platforms. Step 2: The video is split into video clips. Step 3: Video clips are sampled as images. Step 4: Generate subtitles from video clips using a subtitle generation model. Step 5: Develop annotation guidelines to guide annotations. Step 6: Three annotators annotate image and text data based on annotation guidelines. (NER: named entity recognition, RE: relation extraction, CR: coreference resolution, VG: visual grounding)
  • Figure 3: Quantity statistics for each relation type.
  • Figure 4: Some examples of visual grounding annotations. The red entity in the text corresponds to the visual target in the image.
  • Figure 5: The overall architecture of our model. The dashed line indicates execution when the modality is missing. For a detailed introduction to DFFM and MMCM, see Section \ref{['DFFM_section']} and \ref{['MMCM_section']}. $\bigoplus$ represents the element-wise summation operation.
  • ...and 4 more figures