Table of Contents
Fetching ...

ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé

TL;DR

This work introduces ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset, and proposes a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification.

Abstract

While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-turn settings due to error accumulation and breakdowns in shared context, with strong performance on stylistic edits but frequent execution failures on data-centric transformations. ChartEditBench, establishes a challenging testbed for grounded, intent-aware multimodal programming.

ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

TL;DR

This work introduces ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset, and proposes a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification.

Abstract

While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-turn settings due to error accumulation and breakdowns in shared context, with strong performance on stylistic edits but frequent execution failures on data-centric transformations. ChartEditBench, establishes a challenging testbed for grounded, intent-aware multimodal programming.
Paper Structure (64 sections, 1 equation, 13 figures, 1 table)

This paper contains 64 sections, 1 equation, 13 figures, 1 table.

Figures (13)

  • Figure 1: Overall score degradation across conversation turns. All models exhibit declining performance, with smaller models (InternVL3-1B, Qwen3-VL-2B) showing steeper drops. The gap between high-performing and low-performing models widens progressively.
  • Figure 2: Model performance curves across difficulty levels (weighted by rendering success). Proprietary models and larger open-source models maintain performance better on hard modifications, while smaller models show steeper degradation.
  • Figure 3: Model performance by modification type. Style-focused modifications (axis_style, series_label) yield higher scores, while data-centric operations (rolling_average, data_transformation) prove more challenging across all models.
  • Figure 4: Instruction following by evaluation type across turns. Left: Programmatic (assertion-based, 0--1 scale). Right: LLM-judged (semantic, 1--5 scale). Top models maintain advantage in both categories, with LLM-judged scores showing more stability across turns.
  • Figure 5: Distribution of samples across difficulty levels in ChartEditBench. The dataset emphasizes medium and hard modifications to maximize evaluation discriminability.
  • ...and 8 more figures