Table of Contents
Fetching ...

START: Spatial and Textual Learning for Chart Understanding

Zhuoming Liu, Xiaofeng Gao, Feiyang Niu, Qiaozi Gao, Liu Liu, Robinson Piramuthu

TL;DR

This paper addresses chart understanding by introducing START, a spatial–textual learning framework for multimodal large language models that jointly reason about a chart's visual layout and its underlying data. It combines chart-element grounding (spatial) and chart-to-code generation (textual) within supervised finetuning and reinforcement learning, backed by the START-Dataset and the Chart Spatial understanding Benchmark (CS-Bench). The dataset pipeline translates real-world charts into executable code to recover data representations and element locations, while CS-Bench provides a rigorous evaluation of spatial reasoning in charts. Empirical results show consistent gains across diverse benchmarks and model sizes, demonstrating the value of dual learning for fine-grained chart reasoning and opening the door to more accurate chart intelligence in real-world applications.

Abstract

Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model's ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.

START: Spatial and Textual Learning for Chart Understanding

TL;DR

This paper addresses chart understanding by introducing START, a spatial–textual learning framework for multimodal large language models that jointly reason about a chart's visual layout and its underlying data. It combines chart-element grounding (spatial) and chart-to-code generation (textual) within supervised finetuning and reinforcement learning, backed by the START-Dataset and the Chart Spatial understanding Benchmark (CS-Bench). The dataset pipeline translates real-world charts into executable code to recover data representations and element locations, while CS-Bench provides a rigorous evaluation of spatial reasoning in charts. Empirical results show consistent gains across diverse benchmarks and model sizes, demonstrating the value of dual learning for fine-grained chart reasoning and opening the door to more accurate chart intelligence in real-world applications.

Abstract

Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model's ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.

Paper Structure

This paper contains 19 sections, 5 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: START's reward design in reinforcement learning.
  • Figure 2: A: The analysis of the existing chart datasets and B: the overview of the START-dataset generation pipeline.
  • Figure 3: A: The dataset sample visualization and B: the START-SFT and START-RL dataset statistics.
  • Figure 4: A: The samples from refChartQA vogel2025refchartqa, the locations are related to limited types of chart components, and focus on single-subplot chart images. B: the CS-Bench statistics. C: the samples from CS-Bench with the visualized target region under a red mask.
  • Figure 5: The visualization of the predictions from START verse Qwen2.5-VL bai2025qwen2. Benefit from the spatial and temporal learning, START produces better predictions in chart question answering (Subplot A), chart element grounding (Subplot B), and chart-to-code (Subplot C), reflecting the enhancement in MLLM's spatial and textual understanding toward the charts.
  • ...and 10 more figures