Table of Contents
Fetching ...

Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends

Mirna Al-Shetairy, Hanan Hindy, Dina Khattab, Mostafa M. Aref

TL;DR

Significant progress has been made in OCR dependency, handling low-resolution images, and enhancing visual reasoning, but challenges remain in OCR dependency, handling low-resolution images, and enhancing visual reasoning.

Abstract

In recent years, interest in vision-language tasks has grown, especially those involving chart interactions. These tasks are inherently multimodal, requiring models to process chart images, accompanying text, underlying data tables, and often user queries. Traditionally, Chart Understanding (CU) relied on heuristics and rule-based systems. However, recent advancements that have integrated transformer architectures significantly improved performance. This paper reviews prominent research in CU, focusing on State-of-The-Art (SoTA) frameworks that employ transformers within End-to-End (E2E) solutions. Relevant benchmarking datasets and evaluation techniques are analyzed. Additionally, this article identifies key challenges and outlines promising future directions for advancing CU solutions. Following the PRISMA guidelines, a comprehensive literature search is conducted across Google Scholar, focusing on publications from Jan'20 to Jun'24. After rigorous screening and quality assessment, 32 studies are selected for in-depth analysis. The CU tasks are categorized into a three-layered paradigm based on the cognitive task required. Recent advancements in the frameworks addressing various CU tasks are also reviewed. Frameworks are categorized into single-task or multi-task based on the number of tasks solvable by the E2E solution. Within multi-task frameworks, pre-trained and prompt-engineering-based techniques are explored. This review overviews leading architectures, datasets, and pre-training tasks. Despite significant progress, challenges remain in OCR dependency, handling low-resolution images, and enhancing visual reasoning. Future directions include addressing these challenges, developing robust benchmarks, and optimizing model efficiency. Additionally, integrating explainable AI techniques and exploring the balance between real and synthetic data are crucial for advancing CU research.

Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends

TL;DR

Significant progress has been made in OCR dependency, handling low-resolution images, and enhancing visual reasoning, but challenges remain in OCR dependency, handling low-resolution images, and enhancing visual reasoning.

Abstract

In recent years, interest in vision-language tasks has grown, especially those involving chart interactions. These tasks are inherently multimodal, requiring models to process chart images, accompanying text, underlying data tables, and often user queries. Traditionally, Chart Understanding (CU) relied on heuristics and rule-based systems. However, recent advancements that have integrated transformer architectures significantly improved performance. This paper reviews prominent research in CU, focusing on State-of-The-Art (SoTA) frameworks that employ transformers within End-to-End (E2E) solutions. Relevant benchmarking datasets and evaluation techniques are analyzed. Additionally, this article identifies key challenges and outlines promising future directions for advancing CU solutions. Following the PRISMA guidelines, a comprehensive literature search is conducted across Google Scholar, focusing on publications from Jan'20 to Jun'24. After rigorous screening and quality assessment, 32 studies are selected for in-depth analysis. The CU tasks are categorized into a three-layered paradigm based on the cognitive task required. Recent advancements in the frameworks addressing various CU tasks are also reviewed. Frameworks are categorized into single-task or multi-task based on the number of tasks solvable by the E2E solution. Within multi-task frameworks, pre-trained and prompt-engineering-based techniques are explored. This review overviews leading architectures, datasets, and pre-training tasks. Despite significant progress, challenges remain in OCR dependency, handling low-resolution images, and enhancing visual reasoning. Future directions include addressing these challenges, developing robust benchmarks, and optimizing model efficiency. Additionally, integrating explainable AI techniques and exploring the balance between real and synthetic data are crucial for advancing CU research.

Paper Structure

This paper contains 53 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Statistics on recent research trends obtained through Scopus. (a) Peer-reviewed articles addressing various CU tasks such as CQA, Chart Summarization, and Reasoning over Charts. (b) Frequency of different keywords (e.g., Transformers, BERT and Self-attention) appearing in peer-reviewed articles over the past years in computer science. (c) Articles utilization of different deep learning architectures, namely Transformers, CNNs and GNNs.
  • Figure 2: An overview of the different modalities in the CU domain. Transformer architectures could either work on a single modality or across multiple ones as input. Each transformer block could then address one of the possible output modalities based on the reviewed body of literature, i.e., addressing different output modalities through a combination of different transformer blocks.
  • Figure 3: Multidisciplinary Nature of the Chart Understanding Domain.
  • Figure 4: The Transformer Model Architecture vaswani2017attention.
  • Figure 5: The ViT Model Architecture dosovitskiy2020image.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Definition 1: Chart Images
  • Definition 2: Optical Character Recognition (OCR)
  • Definition 3: Dynamic Encoding
  • Definition 4: Pre-training