Table of Contents
Fetching ...

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Wei He, Zhiheng Xi, Wanxu Zhao, Xiaoran Fan, Yiwen Ding, Zifei Shan, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

This work tackles the challenge of scalable visual chart reasoning in multimodal models by introducing Code-as-Intermediary Translation (CIT), which uses executable chart-plotting code as a bridge between visual and textual modalities. Through CIT, the ReachQA dataset (3,249 charts, 19,963 Q&A pairs) is synthesized at a remarkably low cost, enabling effective distillation of visual reasoning from LLMs into MLLMs. Experiments show ReachQA-trained models achieve substantial gains across chart-centric and general multimodal reasoning benchmarks, with strong generalization to MathVista and MATH-Vision, and improvements are amplified when combining ReachQA with broader data. The findings highlight a scalable path for high-quality multimodal instruction data and provide actionable guidance for dataset construction, model training, and evaluating visual reasoning capabilities.

Abstract

Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs), including recognizing key information from visual inputs and conducting reasoning over it. While fine-tuning MLLMs for reasoning is critical, collecting and annotating charts and questions is expensive, hard to scale, and often results in low-quality annotations. To address this, we propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. The code serves as an intermediary that translates visual chart representations into textual representations, enabling language models to understand cross-modal information and generate reasoning chains accordingly. In this way, we can employ text-based synthesizing techniques to expand chart-plotting code and generate high-quality Q&A pairs for training models. This produces ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities of MLLMs. Experiments show that models fine-tuned with ReachQA not only perform well on chart-related tasks but also show performance gains on general reasoning benchmarks. The code and dataset are publicly available at https://github.com/hewei2001/ReachQA.

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

TL;DR

This work tackles the challenge of scalable visual chart reasoning in multimodal models by introducing Code-as-Intermediary Translation (CIT), which uses executable chart-plotting code as a bridge between visual and textual modalities. Through CIT, the ReachQA dataset (3,249 charts, 19,963 Q&A pairs) is synthesized at a remarkably low cost, enabling effective distillation of visual reasoning from LLMs into MLLMs. Experiments show ReachQA-trained models achieve substantial gains across chart-centric and general multimodal reasoning benchmarks, with strong generalization to MathVista and MATH-Vision, and improvements are amplified when combining ReachQA with broader data. The findings highlight a scalable path for high-quality multimodal instruction data and provide actionable guidance for dataset construction, model training, and evaluating visual reasoning capabilities.

Abstract

Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs), including recognizing key information from visual inputs and conducting reasoning over it. While fine-tuning MLLMs for reasoning is critical, collecting and annotating charts and questions is expensive, hard to scale, and often results in low-quality annotations. To address this, we propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. The code serves as an intermediary that translates visual chart representations into textual representations, enabling language models to understand cross-modal information and generate reasoning chains accordingly. In this way, we can employ text-based synthesizing techniques to expand chart-plotting code and generate high-quality Q&A pairs for training models. This produces ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities of MLLMs. Experiments show that models fine-tuned with ReachQA not only perform well on chart-related tasks but also show performance gains on general reasoning benchmarks. The code and dataset are publicly available at https://github.com/hewei2001/ReachQA.

Paper Structure

This paper contains 45 sections, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Error distribution of three baseline models vs. our ReachQA-trained versions on ChartQA test set chartqa, as judged by GPT-4o. Error types are categorized into Recognition Error, Reasoning Error, and Other Errors (question misinterpretation, factual inconsistency or hallucination, and response refusal).
  • Figure 2: Overview of the Code-as-Intermediary Translation (CIT) method for synthesizing multimodal instruction data. The process starts with 33 seed codes, generating plot codes across various chart types, topics, and complexity levels via Self-Instruct and Evol-Instruct. The chart and instruction sets are constructed bi-directionally, and the final filtering yields ReachQA, a dataset for distilling visual chart reasoning abilities from LLMs to MLLMs.
  • Figure 3: Performance comparison of different training data ratios with 8k total data. The dashed line represents the model's performance trained with full 20k data.
  • Figure 4: Performance comparison of models on 7 general tasks and 5 specialized reasoning tasks.
  • Figure 5: An example of attention visualization from the ChartQA dataset. The top row shows the results from the vanilla LLaVA-Next-Llama3-8B model, while the bottom row displays the results from our fine-tuned model. For each output, we present the attention distribution (highlighted zones) at three key steps, calculated by averaging the attention values of all tokens in each step.
  • ...and 13 more figures