Table of Contents
Fetching ...

Automatic Generation of Python Programs Using Context-Free Grammars

Kamel Yamani, Marwa Naïr, Riyadh Baghdadi

TL;DR

This work tackles the need for scalable, privacy-preserving, and executable Python code data by introducing TinyPy Generator, a context-free grammar–based tool that generates correct-by-construction Python programs from a configurable subset. It leverages a BNF-described CFG to recursively expand production rules, offering six levels of complexity and an end-to-end pipeline that handles exceptions, deduplicates duplicates, executes code to capture outputs, and writes results to files. Key contributions include a detailed grammar for Python, an efficient generation process implemented in Python 3.8.6, and empirical evidence of high diversity and scalability in the generated corpus, plus open-source availability for customization and extension to other languages. The approach enables large-scale, customizable datasets for training language models and validating interpreters/compilers, with practical impact for ML and PL research across multiple languages.

Abstract

In recent years, data has emerged as the new gold, serving as a powerful tool for creating intelligent systems. However, procuring high-quality data remains challenging, especially for code. To address this, we developed TinyPy Generator, a tool that generates random Python programs using a context-free grammar. The generated programs are guaranteed to be correct by construction. Our system uses custom production rules (in the Backus-Naur Form (BNF) format) to recursively generate code. This allows us to generate code with different levels of complexity, ranging from code containing only assignments to more complex code containing conditionals and loops. Our proposed tool enables effortless large-scale Python code generation, beneficial for a wide range of applications. TinyPy Generator is particularly useful in the field of machine learning, where it can generate substantial amounts of Python code for training Python language models. Additionally, researchers who are studying programming languages can utilize this tool to create datasets for their experiments, which can help validate the robustness of code interpreters or compilers. Unlike existing research, we have open-sourced our implementation. This allows customization according to user needs and extends potential usage to other languages.

Automatic Generation of Python Programs Using Context-Free Grammars

TL;DR

This work tackles the need for scalable, privacy-preserving, and executable Python code data by introducing TinyPy Generator, a context-free grammar–based tool that generates correct-by-construction Python programs from a configurable subset. It leverages a BNF-described CFG to recursively expand production rules, offering six levels of complexity and an end-to-end pipeline that handles exceptions, deduplicates duplicates, executes code to capture outputs, and writes results to files. Key contributions include a detailed grammar for Python, an efficient generation process implemented in Python 3.8.6, and empirical evidence of high diversity and scalability in the generated corpus, plus open-source availability for customization and extension to other languages. The approach enables large-scale, customizable datasets for training language models and validating interpreters/compilers, with practical impact for ML and PL research across multiple languages.

Abstract

In recent years, data has emerged as the new gold, serving as a powerful tool for creating intelligent systems. However, procuring high-quality data remains challenging, especially for code. To address this, we developed TinyPy Generator, a tool that generates random Python programs using a context-free grammar. The generated programs are guaranteed to be correct by construction. Our system uses custom production rules (in the Backus-Naur Form (BNF) format) to recursively generate code. This allows us to generate code with different levels of complexity, ranging from code containing only assignments to more complex code containing conditionals and loops. Our proposed tool enables effortless large-scale Python code generation, beneficial for a wide range of applications. TinyPy Generator is particularly useful in the field of machine learning, where it can generate substantial amounts of Python code for training Python language models. Additionally, researchers who are studying programming languages can utilize this tool to create datasets for their experiments, which can help validate the robustness of code interpreters or compilers. Unlike existing research, we have open-sourced our implementation. This allows customization according to user needs and extends potential usage to other languages.
Paper Structure (24 sections, 1 equation, 5 figures, 2 tables)

This paper contains 24 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The start symbol "ALL" used in the TinyPy Generator's code generation process.
  • Figure 2: TinyPy Generator's Code Generation Process.
  • Figure 3: Code Snippets Examples
  • Figure 4: Use case in Machine Learning Research.
  • Figure 5: Evaluation on the Code Execution task.