Table of Contents
Fetching ...

ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning

Ahmed Masry, Mehrad Shahmohammadi, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty

TL;DR

ChartInstruct tackles the need for general-purpose chart understanding by creating a large instruction-following corpus tailored to charts and by developing two chart-specific vision-language systems: an end-to-end model that fuses a chart-pretrained vision encoder with a language model, and a pipeline that first extracts data tables from charts before querying the LM. The approach yields state-of-the-art results across multiple chart-domain benchmarks and demonstrates strong performance on novel tasks, with human evaluation supporting improved task adherence and usefulness. The work also provides a diverse WebCharts-derived dataset and release-worthy code to foster further research in chart comprehension and reasoning. Overall, ChartInstruct broadens the applicability of vision-language models to real-world chart analysis tasks, addressing both accuracy and generalization concerns in chart reasoning.

Abstract

Charts provide visual representations of data and are widely used for analyzing information, addressing queries, and conveying insights to others. Various chart-related downstream tasks have emerged recently, such as question-answering and summarization. A common strategy to solve these tasks is to fine-tune various models originally trained on vision tasks language. However, such task-specific models are not capable of solving a wide range of chart-related tasks, constraining their real-world applicability. To overcome these challenges, we introduce ChartInstruct: a novel chart-specific vision-language Instruction-following dataset comprising 191K instructions generated with 71K charts. We then present two distinct systems for instruction tuning on such datasets: (1) an end-to-end model that connects a vision encoder for chart understanding with a LLM; and (2) a pipeline model that employs a two-step approach to extract chart data tables and input them into the LLM. In experiments on four downstream tasks, we first show the effectiveness of our model--achieving a new set of state-of-the-art results. Further evaluation shows that our instruction-tuning approach supports a wide array of real-world chart comprehension and reasoning scenarios, thereby expanding the scope and applicability of our models to new kinds of tasks.

ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning

TL;DR

ChartInstruct tackles the need for general-purpose chart understanding by creating a large instruction-following corpus tailored to charts and by developing two chart-specific vision-language systems: an end-to-end model that fuses a chart-pretrained vision encoder with a language model, and a pipeline that first extracts data tables from charts before querying the LM. The approach yields state-of-the-art results across multiple chart-domain benchmarks and demonstrates strong performance on novel tasks, with human evaluation supporting improved task adherence and usefulness. The work also provides a diverse WebCharts-derived dataset and release-worthy code to foster further research in chart comprehension and reasoning. Overall, ChartInstruct broadens the applicability of vision-language models to real-world chart analysis tasks, addressing both accuracy and generalization concerns in chart reasoning.

Abstract

Charts provide visual representations of data and are widely used for analyzing information, addressing queries, and conveying insights to others. Various chart-related downstream tasks have emerged recently, such as question-answering and summarization. A common strategy to solve these tasks is to fine-tune various models originally trained on vision tasks language. However, such task-specific models are not capable of solving a wide range of chart-related tasks, constraining their real-world applicability. To overcome these challenges, we introduce ChartInstruct: a novel chart-specific vision-language Instruction-following dataset comprising 191K instructions generated with 71K charts. We then present two distinct systems for instruction tuning on such datasets: (1) an end-to-end model that connects a vision encoder for chart understanding with a LLM; and (2) a pipeline model that employs a two-step approach to extract chart data tables and input them into the LLM. In experiments on four downstream tasks, we first show the effectiveness of our model--achieving a new set of state-of-the-art results. Further evaluation shows that our instruction-tuning approach supports a wide array of real-world chart comprehension and reasoning scenarios, thereby expanding the scope and applicability of our models to new kinds of tasks.
Paper Structure (29 sections, 15 figures, 14 tables)

This paper contains 29 sections, 15 figures, 14 tables.

Figures (15)

  • Figure 1: Examples of different chart-related tasks from our generated instruction dataset. Examples 1-5 are generated based on predefined tasks similar to previously developed downstream tasks like chart summarization, chart question answering, and fact-checking, while examples 6-9 introduce new types of tasks distilled by LLMs.
  • Figure 2: Instruction tuning process for chart collection. For the WebChart Corpus, the chart data is extracted automatically using Gemini Vision Pro. For distilling new tasks we use GPT-4, for other task generation we either use GPT 3.5 or GPT 4.
  • Figure 3: Top 20 most common root verb (inner circle) and corresponding four object verb pairs for all the generated instructions of our dataset.
  • Figure 4: The architecture for our end-to-end system models: the LLM is frozen in our (i) pre-training step, while it updates its parameters in the (ii) instruction-tuning step. We either use Flan-T5-XL or Llama2 as LLM for this architecture. We show our pipeline system architecture in \ref{['fig:new_pipeline']} in \ref{['appendix:model']}
  • Figure 5: An example prompt to LLM and the corresponding output for a fact-checking task. The input consists of the task description, chart data, any output constraints, and output format.
  • ...and 10 more figures