DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries
Manit Mishra, Abderrahman Braham, Charles Marsom, Bryan Chung, Gavin Griffin, Dakshesh Sidnerlikar, Chatanya Sarin, Arjun Rajaram
TL;DR
This work investigates zero-shot data analysis using a Language Data Scientist (LDS) built on GPT-3.5, enhanced with an Action Plan Generator and SayCan/Chain-of-Thought prompting to extract insights from datasets. It evaluates 15 DataAgent2023 benchmark datasets with 15 queries each, reporting 74 correct answers out of 225 (32.89%), with the best performance on large datasets (~36%). The study identifies two main failure modes—incorrect code generation and token-limit constraints—and discusses future enhancements leveraging GPT-4, refleXion, episodic memory, and larger datasets to improve robustness and scalability. Overall, the results demonstrate promising potential for LLM-driven, low-level data analysis workflows and outline concrete steps to improve accuracy and applicability in real-world data science tasks.
Abstract
Conventional processes for analyzing datasets and extracting meaningful information are often time-consuming and laborious. Previous work has identified manual, repetitive coding and data collection as major obstacles that hinder data scientists from undertaking more nuanced labor and high-level projects. To combat this, we evaluated OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS) that can extrapolate key findings, including correlations and basic information, from a given dataset. The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards, including data science code-generation based tasks involving libraries such as NumPy, Pandas, Scikit-Learn, and TensorFlow, and was broadly successful in correctly answering a given data science query related to the benchmark dataset. The LDS used various novel prompt engineering techniques to effectively answer a given question, including Chain-of-Thought reinforcement and SayCan prompt engineering. Our findings demonstrate great potential for leveraging Large Language Models for low-level, zero-shot data analysis.
