Text Role Classification in Scientific Charts Using Multimodal Transformers

Hye Jin Kim; Nicolas Lell; Ansgar Scherp

Text Role Classification in Scientific Charts Using Multimodal Transformers

Hye Jin Kim, Nicolas Lell, Ansgar Scherp

TL;DR

The paper tackles text role classification in scientific charts by finetuning two pretrained multimodal document-layout models, LayoutLMv3 and UDOP, on chart datasets. It systematically examines data augmentation and balancing to boost performance and assesses robustness to noise (ICPR22-N) and generalization to CHIME-R, DeGruyter, and EconBiz. LayoutLMv3 consistently outperforms UDOP, achieving a peak F1-macro of $82.87\%$ on ICPR22 when trained on ICPR22 alone, and shows stronger generalization, while UDOP benefits more from training on multiple datasets. The study demonstrates that off-the-shelf document-analysis models can be adapted to chart text-role classification, offering practical insights for improving chart readability and supporting automated chart analysis tools.

Abstract

Text role classification involves classifying the semantic role of textual elements within scientific charts. For this task, we propose to finetune two pretrained multimodal document layout analysis models, LayoutLMv3 and UDOP, on chart datasets. The transformers utilize the three modalities of text, image, and layout as input. We further investigate whether data augmentation and balancing methods help the performance of the models. The models are evaluated on various chart datasets, and results show that LayoutLMv3 outperforms UDOP in all experiments. LayoutLMv3 achieves the highest F1-macro score of 82.87 on the ICPR22 test dataset, beating the best-performing model from the ICPR22 CHART-Infographics challenge. Moreover, the robustness of the models is tested on a synthetic noisy dataset ICPR22-N. Finally, the generalizability of the models is evaluated on three chart datasets, CHIME-R, DeGruyter, and EconBiz, for which we added labels for the text roles. Findings indicate that even in cases where there is limited training data, transformers can be used with the help of data augmentation and balancing methods. The source code and datasets are available on GitHub under https://github.com/hjkimk/text-role-classification

Text Role Classification in Scientific Charts Using Multimodal Transformers

TL;DR

on ICPR22 when trained on ICPR22 alone, and shows stronger generalization, while UDOP benefits more from training on multiple datasets. The study demonstrates that off-the-shelf document-analysis models can be adapted to chart text-role classification, offering practical insights for improving chart readability and supporting automated chart analysis tools.

Abstract

Paper Structure (29 sections, 4 figures, 8 tables)

This paper contains 29 sections, 4 figures, 8 tables.

Introduction
Related Work
Unimodel Transformers
Multimodal Models
Text Role Classification
Methods
Models
LayoutLMv3
UDOP
Data Augmentation and Balancing
Data Augmentation
Data Balancing
Experimental Apparatus
Datasets
ICPR22
...and 14 more sections

Figures (4)

Figure 1: A sample bar chart from ICPR22. Along with the chart image and the text, the text bounding box coordinates are used as the position modality for the multimodal input to the transformers.
Figure 2: Demonstration of cutout augmentation applied to a bar chart from ICPR22. In this example, the chart is augmented with 10 masks for the tick label class.
Figure 3: Example charts from each dataset
Figure 4: Example case where deleting characters from a text element resulted in character exclusions from the bounding box. Upon deleting "Fre" from the text element "French controls from general population" and adjusting the bounding box, "gen" in the following line is also excluded from the bounding box.

Text Role Classification in Scientific Charts Using Multimodal Transformers

TL;DR

Abstract

Text Role Classification in Scientific Charts Using Multimodal Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)