Table of Contents
Fetching ...

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Ningyu Zhang, Huajun Chen

TL;DR

The paper investigates why open-source LLMs struggle with data analysis by decomposing the task into Data Comprehension, Code Generation, and Strategic Planning. It reveals that strategic planning quality is the key determinant of success, that interaction design and task complexity shape reasoning, and that high-quality data outperforms merely diverse data. A strategy-guided data synthesis pipeline is proposed and shown to yield significant improvements, enabling open-source LLMs to approach the performance of leading closed-source systems on data-analytic tasks. The work provides actionable guidance on data collection, interaction design, and training to advance open-source data-analysis capabilities and sets the stage for broader evaluations across models and domains.

Abstract

Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities. Code is available at https://github.com/zjunlp/DataMind.

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

TL;DR

The paper investigates why open-source LLMs struggle with data analysis by decomposing the task into Data Comprehension, Code Generation, and Strategic Planning. It reveals that strategic planning quality is the key determinant of success, that interaction design and task complexity shape reasoning, and that high-quality data outperforms merely diverse data. A strategy-guided data synthesis pipeline is proposed and shown to yield significant improvements, enabling open-source LLMs to approach the performance of leading closed-source systems on data-analytic tasks. The work provides actionable guidance on data collection, interaction design, and training to advance open-source data-analysis capabilities and sets the stage for broader evaluations across models and domains.

Abstract

Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities. Code is available at https://github.com/zjunlp/DataMind.

Paper Structure

This paper contains 37 sections, 1 equation, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Core capabilities involved in data analysis tasks. We break down the process into three key components: data understanding, coding and planning.
  • Figure 2: The distribution of error type.
  • Figure 3: Impact of dialogue turn strategies across different Qwen model scales and training methods.
  • Figure 4: Impact of reasoning length on model performance across token budgets.
  • Figure 5: Impact of training data difficulty on interaction patterns. (a) Average number of response rounds of the model. (b) Average output token length of the model.
  • ...and 3 more figures