Learning Program Behavioral Models from Synthesized Input-Output Pairs

Tural Mammadov; Dietrich Klakow; Alexander Koller; Andreas Zeller

Learning Program Behavioral Models from Synthesized Input-Output Pairs

Tural Mammadov, Dietrich Klakow, Alexander Koller, Andreas Zeller

TL;DR

Modelizer tackles the challenge of modeling a black-box program’s behavior by learning a reversible, differentiable sequence-to-sequence model from input-output pairs generated via grammar-based synthesis. It separates structure from content using placeholders and structure-aware tokenizers, enabling compact, high-accuracy forward and inverse predictions with relatively few parameters. The approach demonstrates strong performance on real-world translators (e.g., Pandoc Markdown↔HTML) and several domain-specific conversion tasks, while highlighting limitations in long sequences and stateful behavior. By enabling program-specific behavior mocking, reverse prediction, and targeted test input generation, Modelizer offers a practical alternative to large language models for software engineering tasks and program understanding. Open-source availability and a modular pipeline further support reuse and extension to other programs and domains.

Abstract

We introduce Modelizer - a novel framework that, given a black-box program, learns a model from its input/output behavior using neural machine translation algorithms. The resulting model mocks the original program: Given an input, the model predicts the output that would have been produced by the program. However, the model is also reversible - that is, the model can predict the input that would have produced a given output. Finally, the model is differentiable and can be efficiently restricted to predict only a certain aspect of the program behavior. Modelizer uses grammars to synthesize and inputs and unsupervised tokenizers to decompose the resulting outputs, allowing it to learn sequence-to-sequence associations between token streams. Other than input grammars, Modelizer only requires the ability to execute the program. The resulting models are small, requiring fewer than 6.3 million parameters for languages such as Markdown or HTML; and they are accurate, achieving up to 95.4% accuracy and a BLEU score of 0.98 with standard error 0.04 in mocking real-world applications. As it learns from and predicts executions rather than code, Modelizer departs from the LLM-centric research trend, opening new opportunities for program-specific models that are fully tuned towards individual programs. Indeed, we foresee several applications of these models, especially as the output of the program can be any aspect of program behavior. Beyond mocking and predicting program behavior, the models can also synthesize inputs that are likely to produce a particular behavior, such as failures or coverage, thus assisting in program understanding and maintenance.

Learning Program Behavioral Models from Synthesized Input-Output Pairs

TL;DR

Abstract

Paper Structure (34 sections, 9 figures, 12 tables)

This paper contains 34 sections, 9 figures, 12 tables.

Introduction
Approach
Input Generation
Input Generation with Grammars
Placeholders
Dataset Pre-processing
Masked Tokenization
Model Learning
Model Deployment
Implementation
System Requirements
Tokenization with Placeholders
Abstract Tokenizers
A HTML Tokenizer
Token Masking Strategies
...and 19 more sections

Figures (9)

Figure 1: Example of Modelizer in action.
Figure 2: How Modelizer works. Modelizer tests the program with synthesized inputs and automatically learns a reversible program behavior model from extracted input-output pairs.
Figure 3: Input generation pipeline. Modelizer automatically synthesizes and validates unique inputs from given specification. Input generation steps: (1) Synthesis, (2) Post-Processing, (3) Hashing, and (4) Validation.
Figure 4: Model Deployment. An example scenario of the Behavior model predicting program input given the program output. Phases: (1) Input tokenization, (2) Prediction generation, (3) Output reconstruction, (4) Prediction validation.
Figure 5: Markdown Sequence Length Frequency Distribution. The x-axis represents the length of different token sequences, and the y-axis denotes the frequency count of each token length. Every data split is represented by a different color.
...and 4 more figures

Learning Program Behavioral Models from Synthesized Input-Output Pairs

TL;DR

Abstract

Learning Program Behavioral Models from Synthesized Input-Output Pairs

Authors

TL;DR

Abstract

Table of Contents

Figures (9)