Table of Contents
Fetching ...

Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training

Ivan R. Ivanov, Joachim Meyer, Aiden Grossman, William S. Moses, Johannes Doerfert

TL;DR

The paper tackles the scarcity of dynamic execution data for training and evaluating code-focused models by proposing a scalable, stateful input generation framework that can operate across languages compiled to LLVM. It introduces an LLVM-based instrumentation pass and two runtimes (generation and replay) to produce initial memory states and arguments and to replay executions for a given function, enabling practical testing, tuning, and ML training workflows. Key results show a high scalability and coverage: instrumentation of about $99\%$ of modules and replay success for about $90\%$ across roughly $21.5$ million functions, with block coverage rising from $37\%$ to $45\%$ when using five inputs. The work enables large-scale, realistic input datasets and paves the way for data-driven compiler analysis and ML-assisted code reasoning by providing reusable, stateful test inputs across languages and architectures.

Abstract

The size and complexity of software applications is increasing at an accelerating pace. Source code repositories (along with their dependencies) require vast amounts of labor to keep them tested, maintained, and up to date. As the discipline now begins to also incorporate automatically generated programs, automation in testing and tuning is required to keep up with the pace - let alone reduce the present level of complexity. While machine learning has been used to understand and generate code in various contexts, machine learning models themselves are trained almost exclusively on static code without inputs, traces, or other execution time information. This lack of training data limits the ability of these models to understand real-world problems in software. In this work we show that inputs, like code, can be generated automatically at scale. Our generated inputs are stateful, and appear to faithfully reproduce the arbitrary data structures and system calls required to rerun a program function. By building our tool within the compiler, it both can be applied to arbitrary programming languages and architectures and can leverage static analysis and transformations for improved performance. Our approach is able to produce valid inputs, including initial memory states, for 90% of the ComPile dataset modules we explored, for a total of 21.4 million executable functions. Further, we find that a single generated input results in an average block coverage of 37%, whereas guided generation of five inputs improves it to 45%.

Input-Gen: Guided Generation of Stateful Inputs for Testing, Tuning, and Training

TL;DR

The paper tackles the scarcity of dynamic execution data for training and evaluating code-focused models by proposing a scalable, stateful input generation framework that can operate across languages compiled to LLVM. It introduces an LLVM-based instrumentation pass and two runtimes (generation and replay) to produce initial memory states and arguments and to replay executions for a given function, enabling practical testing, tuning, and ML training workflows. Key results show a high scalability and coverage: instrumentation of about of modules and replay success for about across roughly million functions, with block coverage rising from to when using five inputs. The work enables large-scale, realistic input datasets and paves the way for data-driven compiler analysis and ML-assisted code reasoning by providing reusable, stateful test inputs across languages and architectures.

Abstract

The size and complexity of software applications is increasing at an accelerating pace. Source code repositories (along with their dependencies) require vast amounts of labor to keep them tested, maintained, and up to date. As the discipline now begins to also incorporate automatically generated programs, automation in testing and tuning is required to keep up with the pace - let alone reduce the present level of complexity. While machine learning has been used to understand and generate code in various contexts, machine learning models themselves are trained almost exclusively on static code without inputs, traces, or other execution time information. This lack of training data limits the ability of these models to understand real-world problems in software. In this work we show that inputs, like code, can be generated automatically at scale. Our generated inputs are stateful, and appear to faithfully reproduce the arbitrary data structures and system calls required to rerun a program function. By building our tool within the compiler, it both can be applied to arbitrary programming languages and architectures and can leverage static analysis and transformations for improved performance. Our approach is able to produce valid inputs, including initial memory states, for 90% of the ComPile dataset modules we explored, for a total of 21.4 million executable functions. Further, we find that a single generated input results in an average block coverage of 37%, whereas guided generation of five inputs improves it to 45%.
Paper Structure (43 sections, 12 figures)

This paper contains 43 sections, 12 figures.

Figures (12)

  • Figure 1: Sketch of the input generation framework for a file (left) containing three functions (top, middle, bottom). The user chooses to generate inputs for the top function, which results in an instrumented generation driver and a replay driver. Assuming the bottom function is called by the top one, it is instrumented as well and stays in both binaries. The middle function is not reached from the top function and consequently dropped. Each instrumented run generates an input file that can be executed by the replay driver. Optional profile feedback can guide the generation of new inputs.
  • Figure 2: (a) depicts a simple program which sums up values in a linked list. (b) shows the same program, instrumented for input generation. This captures all "side-effects" necessary to reproduce the execution at a later time. Specifically, we indicate that we need a value of pointer type for the argument, and that we will load from list->value and list->next. (c) presents the version instrumented to run the generated input. Prior to calling the sum() function, the runtime will set up the appropriate memory state for execution and will provide the function argument.
  • Figure 3: Function and global variable declarations (top) that are replaced by definitions as part of the instrumentation pass (bottom). For global variables one level of indirection is introduced to allow their memory to be part of the runtime memory pool. The indirection is eliminated at the beginning of each function that uses the global.
  • Figure 4: We instrument the module for generation using the runtime C API. the read, write, arg, and gen runtime calls are defined for all primitive types.
  • Figure 5: In order to simplify and optimize object addressing, we allocate a large amount of memory for objects handled by our runtime, and implement a logical memory space partitioning strategy where the top bits are the index of the object itself, and the bottom bits refer to the offset in the specific object. Note how when the runtime generates a pointer to a new object, it returns a pointer to the middle of the object, as the program may access objects at negative offsets as well.
  • ...and 7 more figures