Table of Contents
Fetching ...

Protecting multimodal large language models against misleading visualizations

Jonathan Tonglet, Tinne Tuytelaars, Marie-Francine Moens, Iryna Gurevych

TL;DR

The paper assesses the robustness of multimodal large language models (MLLMs) to misleading visualizations by evaluating 19 models across three compound datasets and a ChartQA benchmark, including a real-world mislead subset. It demonstrates that QA accuracy on misleading visuals can fall near random baselines, and it systematically compares six inference-time correction methods. Two methods—table-based QA and redrawing the visualization—provide the largest improvements (up to 19.6 percentage points) but can incur costs on non-misleading data, highlighting trade-offs between robustness and fidelity. The work offers new datasets, code, and insights into the role of parametric knowledge and misleader types, underscoring the need for robust mitigation when deploying chart-understanding systems.

Abstract

Visualizations play a pivotal role in daily communication in an increasingly data-driven world. Research on multimodal large language models (MLLMs) for automated chart understanding has accelerated massively, with steady improvements on standard benchmarks. However, for MLLMs to be reliable, they must be robust to misleading visualizations, i.e., charts that distort the underlying data, leading readers to draw inaccurate conclusions. Here, we uncover an important vulnerability: MLLM question-answering (QA) accuracy on misleading visualizations drops on average to the level of the random baseline. To address this, we provide the first comparison of six inference-time methods to improve QA performance on misleading visualizations, without compromising accuracy on non-misleading ones. We find that two methods, table-based QA and redrawing the visualization, are effective, with improvements of up to 19.6 percentage points. We make our code and data available.

Protecting multimodal large language models against misleading visualizations

TL;DR

The paper assesses the robustness of multimodal large language models (MLLMs) to misleading visualizations by evaluating 19 models across three compound datasets and a ChartQA benchmark, including a real-world mislead subset. It demonstrates that QA accuracy on misleading visuals can fall near random baselines, and it systematically compares six inference-time correction methods. Two methods—table-based QA and redrawing the visualization—provide the largest improvements (up to 19.6 percentage points) but can incur costs on non-misleading data, highlighting trade-offs between robustness and fidelity. The work offers new datasets, code, and insights into the role of parametric knowledge and misleader types, underscoring the need for robust mitigation when deploying chart-understanding systems.

Abstract

Visualizations play a pivotal role in daily communication in an increasingly data-driven world. Research on multimodal large language models (MLLMs) for automated chart understanding has accelerated massively, with steady improvements on standard benchmarks. However, for MLLMs to be reliable, they must be robust to misleading visualizations, i.e., charts that distort the underlying data, leading readers to draw inaccurate conclusions. Here, we uncover an important vulnerability: MLLM question-answering (QA) accuracy on misleading visualizations drops on average to the level of the random baseline. To address this, we provide the first comparison of six inference-time methods to improve QA performance on misleading visualizations, without compromising accuracy on non-misleading ones. We find that two methods, table-based QA and redrawing the visualization, are effective, with improvements of up to 19.6 percentage points. We make our code and data available.

Paper Structure

This paper contains 27 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Three examples of real-world misleading visualizations lo2022misinformed with MCQs. The correct answer is colored in green, while the wrong answer supported by the misleader is colored in purple.
  • Figure 2: Non-misleading and misleading visualizations of the same data 10.1145/3380851.3416762 with a Likert-scale question where 1 means "a little" and 6 means "a lot". A consistent interpretation requires identical responses. However, the deceived reader chooses a higher value for the truncated bar chart.
  • Figure 3: Illustration of the six inference-time correction methods applied to a misleading visualization from CALVI 10.1145/3544548.3581406. The visualization suffers from inconsistent tick intervals on the y-axis.
  • Figure 4: Top: Accuracy (%) of various MLLMs on misleading visualization, non-misleading visualization, and ChartQA datasets. The horizontal dashed line indicates the accuracy of the random baseline on misleading visualizations. Models are sorted by increasing accuracy on ChartQA. Bottom: Accuracy (%) of various MLLMs on subsets of the misleading visualizations. The horizontal dashed lines indicate average human accuracy on CALVI and CHARTOM 10.1145/3544548.3581406rho2023variousbharti2024chartom.
  • Figure 5: Average Likert-scale ratings (1 to 6) on four pairs of misleading and non-misleading visualizations depicting the same data. MLLM results are reported with standard deviations. Average human results are reported from 10.1145/3380851.3416762.
  • ...and 10 more figures