Table of Contents
Fetching ...

Text Role Classification in Scientific Charts Using Multimodal Transformers

Hye Jin Kim, Nicolas Lell, Ansgar Scherp

TL;DR

The paper tackles text role classification in scientific charts by finetuning two pretrained multimodal document-layout models, LayoutLMv3 and UDOP, on chart datasets. It systematically examines data augmentation and balancing to boost performance and assesses robustness to noise (ICPR22-N) and generalization to CHIME-R, DeGruyter, and EconBiz. LayoutLMv3 consistently outperforms UDOP, achieving a peak F1-macro of $82.87\%$ on ICPR22 when trained on ICPR22 alone, and shows stronger generalization, while UDOP benefits more from training on multiple datasets. The study demonstrates that off-the-shelf document-analysis models can be adapted to chart text-role classification, offering practical insights for improving chart readability and supporting automated chart analysis tools.

Abstract

Text role classification involves classifying the semantic role of textual elements within scientific charts. For this task, we propose to finetune two pretrained multimodal document layout analysis models, LayoutLMv3 and UDOP, on chart datasets. The transformers utilize the three modalities of text, image, and layout as input. We further investigate whether data augmentation and balancing methods help the performance of the models. The models are evaluated on various chart datasets, and results show that LayoutLMv3 outperforms UDOP in all experiments. LayoutLMv3 achieves the highest F1-macro score of 82.87 on the ICPR22 test dataset, beating the best-performing model from the ICPR22 CHART-Infographics challenge. Moreover, the robustness of the models is tested on a synthetic noisy dataset ICPR22-N. Finally, the generalizability of the models is evaluated on three chart datasets, CHIME-R, DeGruyter, and EconBiz, for which we added labels for the text roles. Findings indicate that even in cases where there is limited training data, transformers can be used with the help of data augmentation and balancing methods. The source code and datasets are available on GitHub under https://github.com/hjkimk/text-role-classification

Text Role Classification in Scientific Charts Using Multimodal Transformers

TL;DR

The paper tackles text role classification in scientific charts by finetuning two pretrained multimodal document-layout models, LayoutLMv3 and UDOP, on chart datasets. It systematically examines data augmentation and balancing to boost performance and assesses robustness to noise (ICPR22-N) and generalization to CHIME-R, DeGruyter, and EconBiz. LayoutLMv3 consistently outperforms UDOP, achieving a peak F1-macro of on ICPR22 when trained on ICPR22 alone, and shows stronger generalization, while UDOP benefits more from training on multiple datasets. The study demonstrates that off-the-shelf document-analysis models can be adapted to chart text-role classification, offering practical insights for improving chart readability and supporting automated chart analysis tools.

Abstract

Text role classification involves classifying the semantic role of textual elements within scientific charts. For this task, we propose to finetune two pretrained multimodal document layout analysis models, LayoutLMv3 and UDOP, on chart datasets. The transformers utilize the three modalities of text, image, and layout as input. We further investigate whether data augmentation and balancing methods help the performance of the models. The models are evaluated on various chart datasets, and results show that LayoutLMv3 outperforms UDOP in all experiments. LayoutLMv3 achieves the highest F1-macro score of 82.87 on the ICPR22 test dataset, beating the best-performing model from the ICPR22 CHART-Infographics challenge. Moreover, the robustness of the models is tested on a synthetic noisy dataset ICPR22-N. Finally, the generalizability of the models is evaluated on three chart datasets, CHIME-R, DeGruyter, and EconBiz, for which we added labels for the text roles. Findings indicate that even in cases where there is limited training data, transformers can be used with the help of data augmentation and balancing methods. The source code and datasets are available on GitHub under https://github.com/hjkimk/text-role-classification
Paper Structure (29 sections, 4 figures, 8 tables)

This paper contains 29 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: A sample bar chart from ICPR22. Along with the chart image and the text, the text bounding box coordinates are used as the position modality for the multimodal input to the transformers.
  • Figure 2: Demonstration of cutout augmentation applied to a bar chart from ICPR22. In this example, the chart is augmented with 10 masks for the tick label class.
  • Figure 3: Example charts from each dataset
  • Figure 4: Example case where deleting characters from a text element resulted in character exclusions from the bounding box. Upon deleting "Fre" from the text element "French controls from general population" and adjusting the bounding box, "gen" in the following line is also excluded from the bounding box.