Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study
Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Ningyu Zhang, Huajun Chen
TL;DR
The paper investigates why open-source LLMs struggle with data analysis by decomposing the task into Data Comprehension, Code Generation, and Strategic Planning. It reveals that strategic planning quality is the key determinant of success, that interaction design and task complexity shape reasoning, and that high-quality data outperforms merely diverse data. A strategy-guided data synthesis pipeline is proposed and shown to yield significant improvements, enabling open-source LLMs to approach the performance of leading closed-source systems on data-analytic tasks. The work provides actionable guidance on data collection, interaction design, and training to advance open-source data-analysis capabilities and sets the stage for broader evaluations across models and domains.
Abstract
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities. Code is available at https://github.com/zjunlp/DataMind.
