Table of Contents
Fetching ...

Solving Data-centric Tasks using Large Language Models

Shraddha Barke, Christian Poelitz, Carina Suzana Negreanu, Benjamin Zorn, José Cambronero, Andrew D. Gordon, Vu Le, Elnaz Nouri, Nadia Polikarpova, Advait Sarkar, Brian Slininger, Neil Toronto, Jack Williams

TL;DR

The experiments show that LLM performance is indeed sensitive to the amount of data passed in the prompt, and that for tasks with a lot of syntactic variation in the input table, the cluster-then-select technique outperforms a random selection baseline.

Abstract

Large language models (LLMs) are rapidly replacing help forums like StackOverflow, and are especially helpful for non-professional programmers and end users. These users are often interested in data-centric tasks, such as spreadsheet manipulation and data wrangling, which are hard to solve if the intent is only communicated using a natural-language description, without including the data. But how do we decide how much data and which data to include in the prompt? This paper makes two contributions towards answering this question. First, we create a dataset of real-world NL-to-code tasks manipulating tabular data, mined from StackOverflow posts. Second, we introduce a cluster-then-select prompting technique, which adds the most representative rows from the input data to the LLM prompt. Our experiments show that LLM performance is indeed sensitive to the amount of data passed in the prompt, and that for tasks with a lot of syntactic variation in the input table, our cluster-then-select technique outperforms a random selection baseline.

Solving Data-centric Tasks using Large Language Models

TL;DR

The experiments show that LLM performance is indeed sensitive to the amount of data passed in the prompt, and that for tasks with a lot of syntactic variation in the input table, the cluster-then-select technique outperforms a random selection baseline.

Abstract

Large language models (LLMs) are rapidly replacing help forums like StackOverflow, and are especially helpful for non-professional programmers and end users. These users are often interested in data-centric tasks, such as spreadsheet manipulation and data wrangling, which are hard to solve if the intent is only communicated using a natural-language description, without including the data. But how do we decide how much data and which data to include in the prompt? This paper makes two contributions towards answering this question. First, we create a dataset of real-world NL-to-code tasks manipulating tabular data, mined from StackOverflow posts. Second, we introduce a cluster-then-select prompting technique, which adds the most representative rows from the input data to the LLM prompt. Our experiments show that LLM performance is indeed sensitive to the amount of data passed in the prompt, and that for tasks with a lot of syntactic variation in the input table, our cluster-then-select technique outperforms a random selection baseline.
Paper Structure (26 sections, 8 figures, 2 algorithms)

This paper contains 26 sections, 8 figures, 2 algorithms.

Figures (8)

  • Figure 1: An overview of our cluster-then-select prompting technique. The input is a data table and natural language query. The rows in the data table are first clustered based on their syntactic structure (in this case the name format). We depict different clusters using distinct colors. The most representative rows are then selected from each cluster to create a prompt to pass to the model. Finally, the generated completion is used to create an output column.
  • Figure 2: pass@$k$ with (a) no-data, (b) first-row, and (c) ten-rows passed to the model. The leftmost group of bars represent pass@k with all classes followed by separate pass@k for ind, dep and ext tasks.
  • Figure 3: pass@$k$ for 39% (17/44) dep tasks (with more than two clusters) with no-data, random selection (random-n), representative selection (represent-n) and pass@1 with greedy sampling for full-data (1000 rows).
  • Figure 4: pass@$k$ for all dep tasks with no-data, and n=1, 5 and 10 rows passed to the model, using random (random-n), representative selection (represent-n). The completions are evaluated on 1000 rows.
  • Figure 5: Our tool transforms an input table and a query into a list of valid completions. The input data is used to extract the selected rows $R$. The resulting rows and query are used to construct a prompt which is fed to a code synthesis LLM, such as Gpt-4 or CodeLlama, generating multiple possible completions. The outputs of these completions are then validated and the first $k$ valid completions (along with the outputs) are returned.
  • ...and 3 more figures