Table of Contents
Fetching ...

AltChart: Enhancing VLM-based Chart Summarization Through Multi-Pretext Tasks

Omar Moured, Jiaming Zhang, M. Saquib Sarfraz, Rainer Stiefelhagen

TL;DR

AltChart addresses chart accessibility for blind and visually impaired users by introducing a real-chart dataset with long, semantically rich alt-text and by pretraining vision-language models with multiple pretext tasks to learn fine-grained chart representations. The AltChart dataset contains 10,000 real chart images spanning eight chart types and ten semantic attributes, paired with human-authored, accessibility-compliant summaries. The proposed multi-pretext pretraining framework yields a notable improvement of about $2.5\%$ and achieves state-of-the-art performance across VisText, Chart-2-Text, and AltChart benchmarks, while using a relatively compact model of around 180M parameters. The authors provide public release of dataset and code, discuss limitations such as residual hallucinations on complex charts, and outline future directions like exploring more encoder architectures and end-to-end training.

Abstract

Chart summarization is a crucial task for blind and visually impaired individuals as it is their primary means of accessing and interpreting graphical data. Crafting high-quality descriptions is challenging because it requires precise communication of essential details within the chart without vision perception. Many chart analysis methods, however, produce brief, unstructured responses that may contain significant hallucinations, affecting their reliability for blind people. To address these challenges, this work presents three key contributions: (1) We introduce the AltChart dataset, comprising 10,000 real chart images, each paired with a comprehensive summary that features long-context, and semantically rich annotations. (2) We propose a new method for pretraining Vision-Language Models (VLMs) to learn fine-grained chart representations through training with multiple pretext tasks, yielding a performance gain with ${\sim}2.5\%$. (3) We conduct extensive evaluations of four leading chart summarization models, analyzing how accessible their descriptions are. Our dataset and codes are publicly available on our project page: https://github.com/moured/AltChart.

AltChart: Enhancing VLM-based Chart Summarization Through Multi-Pretext Tasks

TL;DR

AltChart addresses chart accessibility for blind and visually impaired users by introducing a real-chart dataset with long, semantically rich alt-text and by pretraining vision-language models with multiple pretext tasks to learn fine-grained chart representations. The AltChart dataset contains 10,000 real chart images spanning eight chart types and ten semantic attributes, paired with human-authored, accessibility-compliant summaries. The proposed multi-pretext pretraining framework yields a notable improvement of about and achieves state-of-the-art performance across VisText, Chart-2-Text, and AltChart benchmarks, while using a relatively compact model of around 180M parameters. The authors provide public release of dataset and code, discuss limitations such as residual hallucinations on complex charts, and outline future directions like exploring more encoder architectures and end-to-end training.

Abstract

Chart summarization is a crucial task for blind and visually impaired individuals as it is their primary means of accessing and interpreting graphical data. Crafting high-quality descriptions is challenging because it requires precise communication of essential details within the chart without vision perception. Many chart analysis methods, however, produce brief, unstructured responses that may contain significant hallucinations, affecting their reliability for blind people. To address these challenges, this work presents three key contributions: (1) We introduce the AltChart dataset, comprising 10,000 real chart images, each paired with a comprehensive summary that features long-context, and semantically rich annotations. (2) We propose a new method for pretraining Vision-Language Models (VLMs) to learn fine-grained chart representations through training with multiple pretext tasks, yielding a performance gain with . (3) We conduct extensive evaluations of four leading chart summarization models, analyzing how accessible their descriptions are. Our dataset and codes are publicly available on our project page: https://github.com/moured/AltChart.
Paper Structure (29 sections, 4 equations, 3 figures, 4 tables)

This paper contains 29 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Two chart samples from AltChart with their annotated summaries. Semantics are indicated by a color code, where <semantic-name> marks the beginning and </semantic-name> marks the end of the semantic segment.
  • Figure 2: Overview of our vision encoder's training approach, starting from the top-left with tasks including puzzle solving, colorization, rotation, and classification. Sample outputs for each corresponding task are shown on the bottom-right of the figure.
  • Figure 3: Qualitative analysis of chart summarization.