Table of Contents
Fetching ...

Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin

TL;DR

This work targets chart parsing, a vision-centric task where LVLMs struggle with dense visual information. It proposes Visual Self-Refine (VSR), a visual-feedback loop that first localizes data points at the pixel level, visualizes these anchors, and uses the feedback to refine perception before decoding into a structured JSON. The ChartVSR model implements VSR with a two-stage pipeline (Refine and Decode) and leverages a diverse data engine plus ChartP-Bench, a challenging new benchmark. Empirical results show ChartVSR achieving competitive performance on existing benchmarks and substantial gains on ChartP-Bench, demonstrating VSR as a general mechanism to enhance perception accuracy in vision-centric tasks.

Abstract

While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.

Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing

TL;DR

This work targets chart parsing, a vision-centric task where LVLMs struggle with dense visual information. It proposes Visual Self-Refine (VSR), a visual-feedback loop that first localizes data points at the pixel level, visualizes these anchors, and uses the feedback to refine perception before decoding into a structured JSON. The ChartVSR model implements VSR with a two-stage pipeline (Refine and Decode) and leverages a diverse data engine plus ChartP-Bench, a challenging new benchmark. Empirical results show ChartVSR achieving competitive performance on existing benchmarks and substantial gains on ChartP-Bench, demonstrating VSR as a general mechanism to enhance perception accuracy in vision-centric tasks.

Abstract

While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.
Paper Structure (42 sections, 4 equations, 16 figures, 6 tables)

This paper contains 42 sections, 4 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Overview of Chart Parsing and Visual Self-Refine (VSR). (a) Even strong models often fail to produce entirely correct results in a single pass of Chart Parsing. (b) A schematic illustration of the proposed VSR method. The process follows the sequence indicated by the gray arrows.
  • Figure 2: Insight of Visual Self-Refine (a) Without visual feedback, strong LVLM (GPT-4o) fails to identify its own parsing errors. (b) In contrast, when the model's output is visualized onto the chart (yellow markers). With this explicit visual feedback, the same model can now readily spot discrepancies and identify its mistakes. This highlights the ineffectiveness of direct self-correction for vision-centric tasks and motivates our proposed Visual Self-Refine approach.
  • Figure 3: An illustrative case of ChartVSR.
  • Figure 4: Overview of our Data Engine. (a) The ChartVSR data generation pipeline. (b) Example data generated by our data engine, showcasing high diversity and high quality.
  • Figure 5: A multi-round refinement case of ChartVSR. The initial prediction (left) contains multiple errors, which are fully corrected after two rounds of refinement.
  • ...and 11 more figures