Text Role Classification in Scientific Charts Using Multimodal Transformers
Hye Jin Kim, Nicolas Lell, Ansgar Scherp
TL;DR
The paper tackles text role classification in scientific charts by finetuning two pretrained multimodal document-layout models, LayoutLMv3 and UDOP, on chart datasets. It systematically examines data augmentation and balancing to boost performance and assesses robustness to noise (ICPR22-N) and generalization to CHIME-R, DeGruyter, and EconBiz. LayoutLMv3 consistently outperforms UDOP, achieving a peak F1-macro of $82.87\%$ on ICPR22 when trained on ICPR22 alone, and shows stronger generalization, while UDOP benefits more from training on multiple datasets. The study demonstrates that off-the-shelf document-analysis models can be adapted to chart text-role classification, offering practical insights for improving chart readability and supporting automated chart analysis tools.
Abstract
Text role classification involves classifying the semantic role of textual elements within scientific charts. For this task, we propose to finetune two pretrained multimodal document layout analysis models, LayoutLMv3 and UDOP, on chart datasets. The transformers utilize the three modalities of text, image, and layout as input. We further investigate whether data augmentation and balancing methods help the performance of the models. The models are evaluated on various chart datasets, and results show that LayoutLMv3 outperforms UDOP in all experiments. LayoutLMv3 achieves the highest F1-macro score of 82.87 on the ICPR22 test dataset, beating the best-performing model from the ICPR22 CHART-Infographics challenge. Moreover, the robustness of the models is tested on a synthetic noisy dataset ICPR22-N. Finally, the generalizability of the models is evaluated on three chart datasets, CHIME-R, DeGruyter, and EconBiz, for which we added labels for the text roles. Findings indicate that even in cases where there is limited training data, transformers can be used with the help of data augmentation and balancing methods. The source code and datasets are available on GitHub under https://github.com/hjkimk/text-role-classification
