Table of Contents
Fetching ...

Synthetic Test Data Generation Using Recurrent Neural Networks: A Position Paper

Razieh Behjati, Erik Arisholm, Chao Tan, Margrethe M. Bedregal

TL;DR

Privacy constraints prevent using real production data for end-to-end testing in information systems, motivating synthetic data generation as a viable alternative. The authors propose an end-to-end recurrent neural network generator to synthesize dynamic meta-events and fully-specified life-events, reducing reliance on extensive glue logic and leveraging the joint probability $p(x,y)$ and posterior $p(y|x)$ for data sampling. The preliminary results show the approach can reproduce near-original distributions (as measured by Jensen-Shannon divergence) and generate well-formed events, while highlighting limitations in learning complex rules (e.g., Modulo-11) and extrapolating beyond the training horizon, guiding future work toward more capable architectures (e.g., Bilateral-LSTM, conditional GANs) and larger, richer datasets. The work has practical significance for privacy-preserving, scalable test data generation in regulated, information-intensive domains, enabling more flexible and realistic integration testing without exposing real personal data.

Abstract

Testing in production-like test environments is an essential part of quality assurance processes in many industries. Provisioning of such test environments, for information-intensive services, involves setting up databases that are rich-enough to enable simulating a wide variety of user scenarios. While production data is perhaps the gold-standard here, many organizations, particularly within the public sectors, are not allowed to use production data for testing purposes due to privacy concerns. The alternatives are to use anonymized data, or synthetically generated data. In this paper, we elaborate on these alternatives and compare them in an industrial context. Further we focus on synthetic data generation and investigate the use of recurrent neural networks for this purpose. In our preliminary experiments, we were able to generate representative and highly accurate data using a recurrent neural network. These results open new research questions that we discuss here, and plan to investigate in our future research.

Synthetic Test Data Generation Using Recurrent Neural Networks: A Position Paper

TL;DR

Privacy constraints prevent using real production data for end-to-end testing in information systems, motivating synthetic data generation as a viable alternative. The authors propose an end-to-end recurrent neural network generator to synthesize dynamic meta-events and fully-specified life-events, reducing reliance on extensive glue logic and leveraging the joint probability and posterior for data sampling. The preliminary results show the approach can reproduce near-original distributions (as measured by Jensen-Shannon divergence) and generate well-formed events, while highlighting limitations in learning complex rules (e.g., Modulo-11) and extrapolating beyond the training horizon, guiding future work toward more capable architectures (e.g., Bilateral-LSTM, conditional GANs) and larger, richer datasets. The work has practical significance for privacy-preserving, scalable test data generation in regulated, information-intensive domains, enabling more flexible and realistic integration testing without exposing real personal data.

Abstract

Testing in production-like test environments is an essential part of quality assurance processes in many industries. Provisioning of such test environments, for information-intensive services, involves setting up databases that are rich-enough to enable simulating a wide variety of user scenarios. While production data is perhaps the gold-standard here, many organizations, particularly within the public sectors, are not allowed to use production data for testing purposes due to privacy concerns. The alternatives are to use anonymized data, or synthetically generated data. In this paper, we elaborate on these alternatives and compare them in an industrial context. Further we focus on synthetic data generation and investigate the use of recurrent neural networks for this purpose. In our preliminary experiments, we were able to generate representative and highly accurate data using a recurrent neural network. These results open new research questions that we discuss here, and plan to investigate in our future research.
Paper Structure (9 sections, 3 figures)

This paper contains 9 sections, 3 figures.

Figures (3)

  • Figure 1: Distribution of the sampled records lengths
  • Figure 2: Distribution of the event types. For the sake of better visualization, events coded as 91 and 92 are excluded from this diagram. These two codes have very high counts in both original ($\thicksim$134 000) and sampled ($\thicksim$19 000) datasets. They are, however, included in the calculation of Jensen-Shannon divergence reported in the text ($0.002 087$). Excluding these event codes results in a Jensen-Shannon divergence of $0.028 685$, which is still quite low, but much higher than the original value, indicating less similarity between the two distributions in the 1-90 range. This demonstrates the impact of the data imbalance on the capability of the model in generating representative data for the less frequent classes. This degree of imbalance however is not present in the real Norwegian National Registry data. It is present in the dataset reported here because several load tests, generating thousands of code 91 and 92 events, were executed in the test environments from which we collected the data for this experiment.
  • Figure 3: Distribution of the time-stamps