Table of Contents
Fetching ...

Charts Are Not Images: On the Challenges of Scientific Chart Editing

Shawn Li, Ryan Rossi, Sungchul Kim, Sunav Choudhary, Franck Dernoncourt, Puneet Mathur, Zhengzhong Tu, Yue Zhao

TL;DR

This work argues that scientific chart editing is a structured transformation problem governed by a graphical grammar, not a pixel-level task. It introduces FigEdit, a large-scale benchmark with 30,836 edited figures across 10 chart types and five task settings, incorporating base-figure generation, multiple annotation types, and a semantics-aware evaluation pipeline. The authors demonstrate that state-of-the-art editing models achieve high pixel similarity yet fail to perform correct, data-consistent edits, highlighting a gap between perceptual quality and semantic fidelity. By formalizing the problem and providing a task-structured benchmark, FigEdit aims to drive development of structure-aware figure editors and fair, semantic evaluation in scientific visualization.

Abstract

Generative models, such as diffusion and autoregressive approaches, have demonstrated impressive capabilities in editing natural images. However, applying these tools to scientific charts rests on a flawed assumption: a chart is not merely an arrangement of pixels but a visual representation of structured data governed by a graphical grammar. Consequently, chart editing is not a pixel-manipulation task but a structured transformation problem. To address this fundamental mismatch, we introduce \textit{FigEdit}, a large-scale benchmark for scientific figure editing comprising over 30,000 samples. Grounded in real-world data, our benchmark is distinguished by its diversity, covering 10 distinct chart types and a rich vocabulary of complex editing instructions. The benchmark is organized into five distinct and progressively challenging tasks: single edits, multi edits, conversational edits, visual-guidance-based edits, and style transfer. Our evaluation of a range of state-of-the-art models on this benchmark reveals their poor performance on scientific figures, as they consistently fail to handle the underlying structured transformations required for valid edits. Furthermore, our analysis indicates that traditional evaluation metrics (e.g., SSIM, PSNR) have limitations in capturing the semantic correctness of chart edits. Our benchmark demonstrates the profound limitations of pixel-level manipulation and provides a robust foundation for developing and evaluating future structure-aware models. By releasing \textit{FigEdit} (https://github.com/adobe-research/figure-editing), we aim to enable systematic progress in structure-aware figure editing, provide a common ground for fair comparison, and encourage future research on models that understand both the visual and semantic layers of scientific charts.

Charts Are Not Images: On the Challenges of Scientific Chart Editing

TL;DR

This work argues that scientific chart editing is a structured transformation problem governed by a graphical grammar, not a pixel-level task. It introduces FigEdit, a large-scale benchmark with 30,836 edited figures across 10 chart types and five task settings, incorporating base-figure generation, multiple annotation types, and a semantics-aware evaluation pipeline. The authors demonstrate that state-of-the-art editing models achieve high pixel similarity yet fail to perform correct, data-consistent edits, highlighting a gap between perceptual quality and semantic fidelity. By formalizing the problem and providing a task-structured benchmark, FigEdit aims to drive development of structure-aware figure editors and fair, semantic evaluation in scientific visualization.

Abstract

Generative models, such as diffusion and autoregressive approaches, have demonstrated impressive capabilities in editing natural images. However, applying these tools to scientific charts rests on a flawed assumption: a chart is not merely an arrangement of pixels but a visual representation of structured data governed by a graphical grammar. Consequently, chart editing is not a pixel-manipulation task but a structured transformation problem. To address this fundamental mismatch, we introduce \textit{FigEdit}, a large-scale benchmark for scientific figure editing comprising over 30,000 samples. Grounded in real-world data, our benchmark is distinguished by its diversity, covering 10 distinct chart types and a rich vocabulary of complex editing instructions. The benchmark is organized into five distinct and progressively challenging tasks: single edits, multi edits, conversational edits, visual-guidance-based edits, and style transfer. Our evaluation of a range of state-of-the-art models on this benchmark reveals their poor performance on scientific figures, as they consistently fail to handle the underlying structured transformations required for valid edits. Furthermore, our analysis indicates that traditional evaluation metrics (e.g., SSIM, PSNR) have limitations in capturing the semantic correctness of chart edits. Our benchmark demonstrates the profound limitations of pixel-level manipulation and provides a robust foundation for developing and evaluating future structure-aware models. By releasing \textit{FigEdit} (https://github.com/adobe-research/figure-editing), we aim to enable systematic progress in structure-aware figure editing, provide a common ground for fair comparison, and encourage future research on models that understand both the visual and semantic layers of scientific charts.

Paper Structure

This paper contains 45 sections, 12 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: FigEdit benchmark. Top-left: an example figure illustrating the basic task. Bottom-left: a radar chart comparing model performance on single edit task, highlighting the benchmark’s ability to reveal differences in editing capabilities. Right: taxonomy of the benchmark covering five tasks (single edit, multi edit, conversational edit, visual guidance, and style transfer).
  • Figure 2: Comparison of chart editing evaluation signals on three representative cases. The left block shows the Input Figure and the Instruction. The right block shows the Output Figure from OmniGen2, the Classic Metrics (e.g., SSIM and PSNR), and the LLM Scores. We observe that classic pixel metrics can remain high while the edit is wrong. This reveals a gap between pixel similarity and semantic edit correctness, which motivates semantics-aware evaluation for figure editing.
  • Figure 3: Qualitative examples of figure editing with three representative instructions. For each case, the input figure and target instruction are shown on the left, and outputs from Imagen 4, GPT-Image, and OmniGen2 are shown on the right.
  • Figure 4: Radar charts for different tasks (normalized with epsilon, LPIPS inverted). Each chart compares all models on SSIM, PSNR, OCR, LPIPS, and three LLM scores.
  • Figure 5: Additional qualitative examples of figure editing results. Each row shows an input figure (left), the corresponding natural language instruction (middle), and the output figures generated by Imagen 4, GPT-Image, and OmniGen 2 (right). The cases cover representative edit types, including data point removal, data point addition, axis text scaling, layout adjustments, and targeted point deletion. While the models sometimes produce visually consistent outputs, they often fail to accurately execute the requested transformation, highlighting the limitations of current instruction-based figure editing systems.