Towards Synthetic Trace Generation of Modeling Operations using In-Context Learning Approach

Vittoriano Muttillo; Claudio Di Sipio; Riccardo Rubei; Luca Berardinelli; MohammadHadi Dehghani

Towards Synthetic Trace Generation of Modeling Operations using In-Context Learning Approach

Vittoriano Muttillo, Claudio Di Sipio, Riccardo Rubei, Luca Berardinelli, MohammadHadi Dehghani

TL;DR

The paper tackles data scarcity and privacy constraints hindering training IMAs for model-driven engineering by proposing a conceptual MBSE framework that combines Modeling Event Recorder (MER) trace capture, in-context learning-based synthetic generation of modeling operations via four LLMs, and the MORGAN IMA for operation recommendations. It formalizes synthetic traces through constructs like $ au^{+}_{j}$ and $ au^{+}(M^1_i)$ and evaluates trace realism with multiple distance-based metrics and a dedicated hallucination measure. Empirical results show GPT-4 can produce traces that closely resemble human ones with minimal hallucination, though IMA accuracy is higher with real traces; mixing synthetic and real traces can mitigate data scarcity while maintaining performance. The work demonstrates that synthetic traces can bootstrap IMA training in industrial MBSE contexts and provides a replication-ready workflow for broader validation and extension.

Abstract

Producing accurate software models is crucial in model-driven software engineering (MDE). However, modeling complex systems is an error-prone task that requires deep application domain knowledge. In the past decade, several automated techniques have been proposed to support academic and industrial practitioners by providing relevant modeling operations. Nevertheless, those techniques require a huge amount of training data that cannot be available due to several factors, e.g., privacy issues. The advent of large language models (LLMs) can support the generation of synthetic data although state-of-the-art approaches are not yet supporting the generation of modeling operations. To fill the gap, we propose a conceptual framework that combines modeling event logs, intelligent modeling assistants, and the generation of modeling operations using LLMs. In particular, the architecture comprises modeling components that help the designer specify the system, record its operation within a graphical modeling environment, and automatically recommend relevant operations. In addition, we generate a completely new dataset of modeling events by telling on the most prominent LLMs currently available. As a proof of concept, we instantiate the proposed framework using a set of existing modeling tools employed in industrial use cases within different European projects. To assess the proposed methodology, we first evaluate the capability of the examined LLMs to generate realistic modeling operations by relying on well-founded distance metrics. Then, we evaluate the recommended operations by considering real-world industrial modeling artifacts. Our findings demonstrate that LLMs can generate modeling events even though the overall accuracy is higher when considering human-based operations.

Towards Synthetic Trace Generation of Modeling Operations using In-Context Learning Approach

TL;DR

and

and evaluates trace realism with multiple distance-based metrics and a dedicated hallucination measure. Empirical results show GPT-4 can produce traces that closely resemble human ones with minimal hallucination, though IMA accuracy is higher with real traces; mixing synthetic and real traces can mitigate data scarcity while maintaining performance. The work demonstrates that synthetic traces can bootstrap IMA training in industrial MBSE contexts and provides a replication-ready workflow for broader validation and extension.

Abstract

Paper Structure (13 sections, 18 equations, 6 figures, 4 tables)

This paper contains 13 sections, 18 equations, 6 figures, 4 tables.

Introduction
Background and Related Work
Motivating example
Proposed approach
Framework components
Evaluation Materials and Methods
Employed tools
Datasets
Evaluating synthetic data
Evaluating modeling recommendations
Results
Threats To Validity
Conclusion

Figures (6)

Figure 1: HEPSYCODE Graphical Modeling Workbench (a) and trace file generated through MER tool (b). The application considered in this scenario is called Digital Cameramuttillo2023.
Figure 2: The proposed approach.
Figure 3: Prompt schema and LLM answer example.
Figure 4: Modeling Event Recorder Workflow
Figure 5: Synthetic Data Quality Evaluation Results. The violin plots show the distribution of points with the scatter plot. The white dots in the center represent the median.
...and 1 more figures

Towards Synthetic Trace Generation of Modeling Operations using In-Context Learning Approach

TL;DR

Abstract

Towards Synthetic Trace Generation of Modeling Operations using In-Context Learning Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (6)