Table of Contents
Fetching ...

See or Recall: A Sanity Check for the Role of Vision in Solving Visualization Question Answer Tasks with Multimodal LLMs

Zhimin Li, Haichao Miao, Xinyuan Yan, Valerio Pascucci, Matthew Berger, Shusen Liu

TL;DR

The paper investigates whether multimodal LLMs truly interpret visualizations in visualization QA tasks or rely on factual recall from training data. It introduces a sanity-check framework combining a rule-based decision tree and a sanity-check table to disentangle seeing from recalling, and validates it with four diverse VisQA datasets (VLAT, VLATForge, VILA, ChartQA). Through factual vs non-factual data generation and information-pathway ablations, the study demonstrates substantial recall-driven performance on several benchmarks and identifies cases where context or visual input either helps or harms answers. It also shows that prompt design alone often does not fix recall biases, and it offers mitigation strategies to improve the reliability and validity of visualization understanding evaluations. Overall, the work cautions against overestimating MLLMs’ visualization reasoning and argues for more rigorous, bias-aware evaluation practices across multimodal tasks.

Abstract

Recent developments in multimodal large language models (MLLM) have equipped language models to reason about vision and language jointly. This permits MLLMs to both perceive and answer questions about data visualization across a variety of designs and tasks. Applying MLLMs to a broad range of visualization tasks requires us to properly evaluate their capabilities, and the most common way to conduct evaluation is through measuring a model's visualization reasoning capability, analogous to how we would evaluate human understanding of visualizations (e.g., visualization literacy). However, we found that in the context of visualization question answering (VisQA), how an MLLM perceives and reasons about visualizations can be fundamentally different from how humans approach the same problem. During the evaluation, even without visualization, the model could correctly answer a substantial portion of the visualization test questions, regardless of whether any selection options were provided. We hypothesize that the vast amount of knowledge encoded in the language model permits factual recall that supersedes the need to seek information from the visual signal. It raises concerns that the current VisQA evaluation may not fully capture the models' visualization reasoning capabilities. To address this, we propose a comprehensive sanity check framework that integrates a rule-based decision tree and a sanity check table to disentangle the effects of "seeing" (visual processing) and "recall" (reliance on prior knowledge). This validates VisQA datasets for evaluation, highlighting where models are truly "seeing", positively or negatively affected by the factual recall, or relying on inductive biases for question answering. Our study underscores the need for careful consideration in designing future visualization understanding studies when utilizing MLLMs.

See or Recall: A Sanity Check for the Role of Vision in Solving Visualization Question Answer Tasks with Multimodal LLMs

TL;DR

The paper investigates whether multimodal LLMs truly interpret visualizations in visualization QA tasks or rely on factual recall from training data. It introduces a sanity-check framework combining a rule-based decision tree and a sanity-check table to disentangle seeing from recalling, and validates it with four diverse VisQA datasets (VLAT, VLATForge, VILA, ChartQA). Through factual vs non-factual data generation and information-pathway ablations, the study demonstrates substantial recall-driven performance on several benchmarks and identifies cases where context or visual input either helps or harms answers. It also shows that prompt design alone often does not fix recall biases, and it offers mitigation strategies to improve the reliability and validity of visualization understanding evaluations. Overall, the work cautions against overestimating MLLMs’ visualization reasoning and argues for more rigorous, bias-aware evaluation practices across multimodal tasks.

Abstract

Recent developments in multimodal large language models (MLLM) have equipped language models to reason about vision and language jointly. This permits MLLMs to both perceive and answer questions about data visualization across a variety of designs and tasks. Applying MLLMs to a broad range of visualization tasks requires us to properly evaluate their capabilities, and the most common way to conduct evaluation is through measuring a model's visualization reasoning capability, analogous to how we would evaluate human understanding of visualizations (e.g., visualization literacy). However, we found that in the context of visualization question answering (VisQA), how an MLLM perceives and reasons about visualizations can be fundamentally different from how humans approach the same problem. During the evaluation, even without visualization, the model could correctly answer a substantial portion of the visualization test questions, regardless of whether any selection options were provided. We hypothesize that the vast amount of knowledge encoded in the language model permits factual recall that supersedes the need to seek information from the visual signal. It raises concerns that the current VisQA evaluation may not fully capture the models' visualization reasoning capabilities. To address this, we propose a comprehensive sanity check framework that integrates a rule-based decision tree and a sanity check table to disentangle the effects of "seeing" (visual processing) and "recall" (reliance on prior knowledge). This validates VisQA datasets for evaluation, highlighting where models are truly "seeing", positively or negatively affected by the factual recall, or relying on inductive biases for question answering. Our study underscores the need for careful consideration in designing future visualization understanding studies when utilizing MLLMs.

Paper Structure

This paper contains 23 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: An example from the VLAT dataset where the highlighted text "oil price in 2015" provides context for MLLM that can trigger factual recall, allowing the model to answer correctly even without visual input.
  • Figure 2: The performance of the GPT-4o model with and without visualization during the question query on the VLAT, VILA, and ChartQA datasets. We evaluate these datasets and display the ratio of questions correctly or incorrectly answered in four scenarios. The ratio of questions that are answered correctly in both (with/without visualization) and only answered correctly without vis is high, indicating that there is substantial recall in VLAT and VILA evaluation. In the ChartQA, a large portion of questions can only be answered correctly with visualization, revealing a relatively low recall rate.
  • Figure 3: With a presented visualization and a question, the pathway for the decision process of humans and MLLM can be different. Humans often have limited recall knowledge about the asked question and rely on interpreting the visualization to answer the question. In contrast, the MLLM can obtain the correct answer without relying on visualization.
  • Figure 4: A decision tree-based workflow to identify the problematic cases. The construction of the decision tree begins with no information (no recall/no see) during the visualization reasoning evaluation to full information (recall/see) for evaluation. The construction of the decision tree ends up with 6 leaf nodes, with four nodes failing the sanity check and two passing it. The overall summary led to three cases to validate the model evaluation.
  • Figure 5: Four experiment setups for the four entries of the sanity check table. Similar to the sanity check table, in the 2x2 image grid, the row encodes whether the context is provided, whereas the column specifies whether the visualization is used for evaluation.
  • ...and 9 more figures