Table of Contents
Fetching ...

Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts

Seon Gyeom Kim, Jae Young Choi, Ryan Rossi, Eunyee Koh, Tak Yeon Lee

TL;DR

This work tackles predicting the experiential impact of data visualizations by introducing Chart-to-Experience, a benchmark of 36 charts rated by crowdsourced participants across seven experiential factors. It evaluates three state-of-the-art MLLMs on two tasks: direct score prediction and pairwise chart comparison, revealing that models display low sensitivity for absolute scores but strong performance for comparisons, especially when human score differences are large. The study highlights biases and limited alignment with human judgments, underscoring that MLLMs are better suited for relative judgments rather than precise scoring in chart evaluation. The benchmark, including human explanations, offers a resource for diagnosing AI behavior and guiding future improvements in how multimodal models assist visualization design and evaluation.

Abstract

The field of Multimodal Large Language Models (MLLMs) has made remarkable progress in visual understanding tasks, presenting a vast opportunity to predict the perceptual and emotional impact of charts. However, it also raises concerns, as many applications of LLMs are based on overgeneralized assumptions from a few examples, lacking sufficient validation of their performance and effectiveness. We introduce Chart-to-Experience, a benchmark dataset comprising 36 charts, evaluated by crowdsourced workers for their impact on seven experiential factors. Using the dataset as ground truth, we evaluated capabilities of state-of-the-art MLLMs on two tasks: direct prediction and pairwise comparison of charts. Our findings imply that MLLMs are not as sensitive as human evaluators when assessing individual charts, but are accurate and reliable in pairwise comparisons.

Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts

TL;DR

This work tackles predicting the experiential impact of data visualizations by introducing Chart-to-Experience, a benchmark of 36 charts rated by crowdsourced participants across seven experiential factors. It evaluates three state-of-the-art MLLMs on two tasks: direct score prediction and pairwise chart comparison, revealing that models display low sensitivity for absolute scores but strong performance for comparisons, especially when human score differences are large. The study highlights biases and limited alignment with human judgments, underscoring that MLLMs are better suited for relative judgments rather than precise scoring in chart evaluation. The benchmark, including human explanations, offers a resource for diagnosing AI behavior and guiding future improvements in how multimodal models assist visualization design and evaluation.

Abstract

The field of Multimodal Large Language Models (MLLMs) has made remarkable progress in visual understanding tasks, presenting a vast opportunity to predict the perceptual and emotional impact of charts. However, it also raises concerns, as many applications of LLMs are based on overgeneralized assumptions from a few examples, lacking sufficient validation of their performance and effectiveness. We introduce Chart-to-Experience, a benchmark dataset comprising 36 charts, evaluated by crowdsourced workers for their impact on seven experiential factors. Using the dataset as ground truth, we evaluated capabilities of state-of-the-art MLLMs on two tasks: direct prediction and pairwise comparison of charts. Our findings imply that MLLMs are not as sensitive as human evaluators when assessing individual charts, but are accurate and reliable in pairwise comparisons.

Paper Structure

This paper contains 12 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The collection of 12 charts on the topic of House Prices
  • Figure 2: The task page for crowdsourced online study
  • Figure 3: The accuracy of MLLMs in comparing pairs of charts across the seven experiential factors is binned by grouping comparisons based on the magnitude of the difference in human ratings between the chart pairs. The observed overall upward trend suggests that MLLMs perform more accurately when comparing chart pairs with larger score disparities.
  • Figure 4: The two charts which mainly decreased the accuracy for interest and aesthetic pleasure.