Table of Contents
Fetching ...

DongbaMIE: A Multimodal Information Extraction Dataset for Evaluating Semantic Understanding of Dongba Pictograms

Xiaojun Bi, Shuo Li, Junyao Xing, Ziyue Wang, Fuwen Luo, Weizheng Qiao, Lu Han, Ziwei Sun, Peng Li, Yang Liu

TL;DR

DongbaMIE introduces the first multimodal information extraction dataset for Dongba pictographs, pairing high-resolution images with Chinese translations across four semantic dimensions (Object, Action, Relation, Attribute) and enabling evaluation under zero-shot, few-shot, and supervised fine-tuning regimes. The authors implement a hybrid annotation pipeline and validate high inter-annotator agreement, providing a robust resource for endangered-script analysis. Across multiple MLLMs, results reveal substantial gaps in zero-shot/few-shot performance and mixed gains from supervised fine-tuning, with complex relations and attributes being particularly challenging and visual feature learning proving crucial. The work advances cultural heritage preservation by establishing a benchmark and outlining directions to improve multimodal understanding of Dongba pictographs, including dataset expansion and enhanced visual representations.

Abstract

Dongba pictographic is the only pictographic script still in use in the world. Its pictorial ideographic features carry rich cultural and contextual information. However, due to the lack of relevant datasets, research on semantic understanding of Dongba hieroglyphs has progressed slowly. To this end, we constructed \textbf{DongbaMIE} - the first dataset focusing on multimodal information extraction of Dongba pictographs. The dataset consists of images of Dongba hieroglyphic characters and their corresponding semantic annotations in Chinese. It contains 23,530 sentence-level and 2,539 paragraph-level high-quality text-image pairs. The annotations cover four semantic dimensions: object, action, relation and attribute. Systematic evaluation of mainstream multimodal large language models shows that the models are difficult to perform information extraction of Dongba hieroglyphs efficiently under zero-shot and few-shot learning. Although supervised fine-tuning can improve the performance, accurate extraction of complex semantics is still a great challenge at present.

DongbaMIE: A Multimodal Information Extraction Dataset for Evaluating Semantic Understanding of Dongba Pictograms

TL;DR

DongbaMIE introduces the first multimodal information extraction dataset for Dongba pictographs, pairing high-resolution images with Chinese translations across four semantic dimensions (Object, Action, Relation, Attribute) and enabling evaluation under zero-shot, few-shot, and supervised fine-tuning regimes. The authors implement a hybrid annotation pipeline and validate high inter-annotator agreement, providing a robust resource for endangered-script analysis. Across multiple MLLMs, results reveal substantial gaps in zero-shot/few-shot performance and mixed gains from supervised fine-tuning, with complex relations and attributes being particularly challenging and visual feature learning proving crucial. The work advances cultural heritage preservation by establishing a benchmark and outlining directions to improve multimodal understanding of Dongba pictographs, including dataset expansion and enhanced visual representations.

Abstract

Dongba pictographic is the only pictographic script still in use in the world. Its pictorial ideographic features carry rich cultural and contextual information. However, due to the lack of relevant datasets, research on semantic understanding of Dongba hieroglyphs has progressed slowly. To this end, we constructed \textbf{DongbaMIE} - the first dataset focusing on multimodal information extraction of Dongba pictographs. The dataset consists of images of Dongba hieroglyphic characters and their corresponding semantic annotations in Chinese. It contains 23,530 sentence-level and 2,539 paragraph-level high-quality text-image pairs. The annotations cover four semantic dimensions: object, action, relation and attribute. Systematic evaluation of mainstream multimodal large language models shows that the models are difficult to perform information extraction of Dongba hieroglyphs efficiently under zero-shot and few-shot learning. Although supervised fine-tuning can improve the performance, accurate extraction of complex semantics is still a great challenge at present.

Paper Structure

This paper contains 22 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: An example of Dongba pictographs from the DongbaMIE dataset.
  • Figure 2: The entire process of establishing the DongbaMIE dataset.
  • Figure 3: The image presents a semantic visualization yielded by the information extraction framework. Above this visualization, sentence-level Dongba pictographs are shown with their Chinese translations. English descriptions are provided solely for understanding; all annotations are originally in Chinese.
  • Figure 4: The image displays a Dongba pictograph with manually added annotations highlighting Objects (blue) and Actions (yellow). The text below shows multimodal semantic extraction results from six models, alongside ground truth labels. Prediction and Ground truth are shown in Chinese, with English notes for clarity.
  • Figure 5: DeepSeek v3 IE Prompt template automates multi-label semantic analysis of Dongba Pictograms, outputting actions, objects, relations, and attributes in structured JSON.
  • ...and 3 more figures