Table of Contents
Fetching ...

CHAOS: Chart Analysis with Outlier Samples

Omar Moured, Yufan Chen, Ruiping Liu, Simon Reiß, Philip Torr, Jiaming Zhang, Rainer Stiefelhagen

TL;DR

CHAOS addresses the problem of chart understanding robustness in Multimodal Large Language Models by introducing a comprehensive perturbation benchmark with 5 textual and 10 visual perturbations across three human-derived severities. It evaluates 13 state-of-the-art MLLMs on two chart-centric tasks, ChartQA and Chart-to-Text, using ChartQA's Relaxed Accuracy and Chart-to-Text's BLEU-4 and Content Selection, while proposing a new robustness metric $\mathcal{R}$ that balances relative degradation with absolute drop. Key findings show that even minor perturbations can cause measurable degradation, textual perturbations can be as impactful as visual ones, and chart-specific models do not automatically achieve higher robustness; PoT-based strategies can improve resilience. The work provides a practical, public benchmark and analysis framework to drive robust chart understanding in real-world applications, with implications for accessibility, data analysis, and downstream chart-centric tasks.

Abstract

Charts play a critical role in data analysis and visualization, yet real-world applications often present charts with challenging or noisy features. However, "outlier charts" pose a substantial challenge even for Multimodal Large Language Models (MLLMs), which can struggle to interpret perturbed charts. In this work, we introduce CHAOS (CHart Analysis with Outlier Samples), a robustness benchmark to systematically evaluate MLLMs against chart perturbations. CHAOS encompasses five types of textual and ten types of visual perturbations, each presented at three levels of severity (easy, mid, hard) inspired by the study result of human evaluation. The benchmark includes 13 state-of-the-art MLLMs divided into three groups (i.e., general-, document-, and chart-specific models) according to the training scope and data. Comprehensive analysis involves two downstream tasks (ChartQA and Chart-to-Text). Extensive experiments and case studies highlight critical insights into robustness of models across chart perturbations, aiming to guide future research in chart understanding domain. Data and code are publicly available at: http://huggingface.co/datasets/omoured/CHAOS.

CHAOS: Chart Analysis with Outlier Samples

TL;DR

CHAOS addresses the problem of chart understanding robustness in Multimodal Large Language Models by introducing a comprehensive perturbation benchmark with 5 textual and 10 visual perturbations across three human-derived severities. It evaluates 13 state-of-the-art MLLMs on two chart-centric tasks, ChartQA and Chart-to-Text, using ChartQA's Relaxed Accuracy and Chart-to-Text's BLEU-4 and Content Selection, while proposing a new robustness metric that balances relative degradation with absolute drop. Key findings show that even minor perturbations can cause measurable degradation, textual perturbations can be as impactful as visual ones, and chart-specific models do not automatically achieve higher robustness; PoT-based strategies can improve resilience. The work provides a practical, public benchmark and analysis framework to drive robust chart understanding in real-world applications, with implications for accessibility, data analysis, and downstream chart-centric tasks.

Abstract

Charts play a critical role in data analysis and visualization, yet real-world applications often present charts with challenging or noisy features. However, "outlier charts" pose a substantial challenge even for Multimodal Large Language Models (MLLMs), which can struggle to interpret perturbed charts. In this work, we introduce CHAOS (CHart Analysis with Outlier Samples), a robustness benchmark to systematically evaluate MLLMs against chart perturbations. CHAOS encompasses five types of textual and ten types of visual perturbations, each presented at three levels of severity (easy, mid, hard) inspired by the study result of human evaluation. The benchmark includes 13 state-of-the-art MLLMs divided into three groups (i.e., general-, document-, and chart-specific models) according to the training scope and data. Comprehensive analysis involves two downstream tasks (ChartQA and Chart-to-Text). Extensive experiments and case studies highlight critical insights into robustness of models across chart perturbations, aiming to guide future research in chart understanding domain. Data and code are publicly available at: http://huggingface.co/datasets/omoured/CHAOS.

Paper Structure

This paper contains 30 sections, 18 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Visualization of CHAOS benchmark with 10 types of visual perturbations (VPs) and 5 types of textual perturbations (TPs).
  • Figure 2: Distribution of human study results across perturbation types (y-axis) and levels (x-axis). Each cell shows the number of participants who answered correctly. Symbols {$\mathrel{\vcenter{\hbox{$\sim$} \hbox{$\sim$} \hbox{$\sim$}}}$, $\approx$, $\sim$} in the cell mean {hard, middle, easy} levels for each perturbation (each row).
  • Figure 3: Visualization of the metric ${\mathcal{R}}$ across perturbed and clean accuracy. All models on the same 'contour' have the same ${\mathcal{R}}$ score. For the same absolute drop (clean${\rightarrow}$perturbed), the model with a lower clean accuracy has a lower robustness. E.g., $\mathcal{R}_{a}{>}\mathcal{R}_{b}{=}\mathcal{R}_{c}$, when $a{=}(0.7, 0.8), b{=}(0.5, 0.6), c{=}(0.33, 0.4)$.
  • Figure 4: Robustness analysis. The clean accuracy is represented by the circle size, while robustness is by color intensity, with lighter colors for higher robustness.
  • Figure 5: Study design. Participants start at (a) the highest perturbation level (Level 10) for each chart. (b) Upon confirming the level is interpretable, a corresponding question is posted.
  • ...and 5 more figures