Table of Contents
Fetching ...

PromptSet: A Programmer's Prompting Dataset

Kaiser Pister, Dhruba Jyoti Paul, Patrick Brophy, Ishan Joshi

TL;DR

PromptSet addresses the lack of static tooling for prompts by introducing a large dataset of programmatic prompts from open-source Python code (>61k prompts) and proposing static linting concepts to improve reliability. The methodology combines AST-based extraction from code (via Tree-sitter) across OpenAI, Anthropic, Cohere, and LangChain usage, followed by post-processing, yield enhancement, and filtering. The authors analyze the dataset to reveal prompt composition, taxonomy alignment, technology propagation, and error patterns, demonstrating the need for prompt hygiene in production pipelines. They release the dataset and tooling publicly to enable CI/CD integration and community-driven improvements in prompt management.

Abstract

The rise of capabilities expressed by large language models has been quickly followed by the integration of the same complex systems into application level logic. Algorithms, programs, systems, and companies are built around structured prompting to black box models where the majority of the design and implementation lies in capturing and quantifying the `agent mode'. The standard way to shape a closed language model is to prime it for a specific task with a tailored prompt, often initially handwritten by a human. The textual prompts co-evolve with the codebase, taking shape over the course of project life as artifacts which must be reviewed and maintained, just as the traditional code files might be. Unlike traditional code, we find that prompts do not receive effective static testing and linting to prevent runtime issues. In this work, we present a novel dataset called PromptSet, with more than 61,000 unique developer prompts used in open source Python programs. We perform analysis on this dataset and introduce the notion of a static linter for prompts. Released with this publication is a HuggingFace dataset and a Github repository to recreate collection and processing efforts, both under the name \texttt{pisterlabs/promptset}.

PromptSet: A Programmer's Prompting Dataset

TL;DR

PromptSet addresses the lack of static tooling for prompts by introducing a large dataset of programmatic prompts from open-source Python code (>61k prompts) and proposing static linting concepts to improve reliability. The methodology combines AST-based extraction from code (via Tree-sitter) across OpenAI, Anthropic, Cohere, and LangChain usage, followed by post-processing, yield enhancement, and filtering. The authors analyze the dataset to reveal prompt composition, taxonomy alignment, technology propagation, and error patterns, demonstrating the need for prompt hygiene in production pipelines. They release the dataset and tooling publicly to enable CI/CD integration and community-driven improvements in prompt management.

Abstract

The rise of capabilities expressed by large language models has been quickly followed by the integration of the same complex systems into application level logic. Algorithms, programs, systems, and companies are built around structured prompting to black box models where the majority of the design and implementation lies in capturing and quantifying the `agent mode'. The standard way to shape a closed language model is to prime it for a specific task with a tailored prompt, often initially handwritten by a human. The textual prompts co-evolve with the codebase, taking shape over the course of project life as artifacts which must be reviewed and maintained, just as the traditional code files might be. Unlike traditional code, we find that prompts do not receive effective static testing and linting to prevent runtime issues. In this work, we present a novel dataset called PromptSet, with more than 61,000 unique developer prompts used in open source Python programs. We perform analysis on this dataset and introduce the notion of a static linter for prompts. Released with this publication is a HuggingFace dataset and a Github repository to recreate collection and processing efforts, both under the name \texttt{pisterlabs/promptset}.
Paper Structure (18 sections, 6 figures, 6 tables)

This paper contains 18 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Example code file
  • Figure 2: Distribution of prompt lengths in PromptSet.
  • Figure 3: Distribution of languages in PromptSet.
  • Figure 4: Zipf's law plotted on tokens from PromptSet.
  • Figure 5: Categorization of PromptSet.
  • ...and 1 more figures