DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries

Manit Mishra; Abderrahman Braham; Charles Marsom; Bryan Chung; Gavin Griffin; Dakshesh Sidnerlikar; Chatanya Sarin; Arjun Rajaram

DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries

Manit Mishra, Abderrahman Braham, Charles Marsom, Bryan Chung, Gavin Griffin, Dakshesh Sidnerlikar, Chatanya Sarin, Arjun Rajaram

TL;DR

This work investigates zero-shot data analysis using a Language Data Scientist (LDS) built on GPT-3.5, enhanced with an Action Plan Generator and SayCan/Chain-of-Thought prompting to extract insights from datasets. It evaluates 15 DataAgent2023 benchmark datasets with 15 queries each, reporting 74 correct answers out of 225 (32.89%), with the best performance on large datasets (~36%). The study identifies two main failure modes—incorrect code generation and token-limit constraints—and discusses future enhancements leveraging GPT-4, refleXion, episodic memory, and larger datasets to improve robustness and scalability. Overall, the results demonstrate promising potential for LLM-driven, low-level data analysis workflows and outline concrete steps to improve accuracy and applicability in real-world data science tasks.

Abstract

Conventional processes for analyzing datasets and extracting meaningful information are often time-consuming and laborious. Previous work has identified manual, repetitive coding and data collection as major obstacles that hinder data scientists from undertaking more nuanced labor and high-level projects. To combat this, we evaluated OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS) that can extrapolate key findings, including correlations and basic information, from a given dataset. The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards, including data science code-generation based tasks involving libraries such as NumPy, Pandas, Scikit-Learn, and TensorFlow, and was broadly successful in correctly answering a given data science query related to the benchmark dataset. The LDS used various novel prompt engineering techniques to effectively answer a given question, including Chain-of-Thought reinforcement and SayCan prompt engineering. Our findings demonstrate great potential for leveraging Large Language Models for low-level, zero-shot data analysis.

DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries

TL;DR

Abstract

Paper Structure (19 sections, 6 figures)

This paper contains 19 sections, 6 figures.

Introduction
Methodology
Summary
Benchmark Datasets DataAgent2023
Queries
Gathering Background Information on a Dataset
Action Plan Generation
Specifics
SayCan
Low-Level Execution
Benchmark Dataset Answer Checker
Results
Discussion and Conclusion
Prompt Rewording
Multiple Answers in One Prompt
...and 4 more sections

Figures (6)

Figure 1: Some differences between common AutoML models and the Language Data Scientist.
Figure 2: A chunk of a sample benchmark dataset, labeled "Cities," containing both numerical and categorical data, along with missing values. The original dataset was medium-sized and had 165 rows.
Figure 3: An example of how natural language steps generated by the AcPG are then translated to code for the executor
Figure 4: A broad overview of the model. A natural-language query, paired with an inputted dataset, is sent both to the AcPG and the LDS. The AcPG then generates a plan of action for answering the question with the given data, and an executor in the LDS computes the final output.
Figure 5: A summary of our results, which were generally stable across different Toy Dataset sizes and queries of varying difficulty.
...and 1 more figures

DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries

TL;DR

Abstract

DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries

Authors

TL;DR

Abstract

Table of Contents

Figures (6)