Exploring the Potential of Large Language Models in Simulink-Stateflow Mutant Generation

Pablo Valle; Shaukat Ali; Aitor Arrieta

Exploring the Potential of Large Language Models in Simulink-Stateflow Mutant Generation

Pablo Valle, Shaukat Ali, Aitor Arrieta

TL;DR

This work addresses the challenge of mutation testing for Simulink-Stateflow models by leveraging Large Language Models to generate domain-specific mutants. It introduces an automated JSON-based pipeline, two mutation strategies (global and local), and few-shot prompting to guide eight state-of-the-art LLMs, evaluating 38,400 mutants across four case-study models. Results show LLMs deliver up to 13x speedups, with higher mutant quality and lower duplication/equivalence rates compared to a handcrafted baseline, though base-line mutants guarantee higher generability and compilability. The study provides practical insights into prompt design and temperature settings, highlights common failure modes, and releases an open-source prototype and dataset to support future research in LLM-assisted mutation testing for CPS.

Abstract

Mutation analysis is a powerful technique for assessing test-suite adequacy, yet conventional approaches suffer from generating redundant, equivalent, or non-executable mutants. These challenges are particularly amplified in Simulink-Stateflow models due to the hierarchical structure these models have, which integrate continuous dynamics with discrete-event behaviors and are widely deployed in safety-critical Cyber-Physical Systems (CPSs). While prior work has explored machine learning and manually engineered mutation operators, these approaches remain constrained by limited training data and scalability issues. Motivated by recent advances in Large Language Models (LLMs), we investigate their potential to generate high-quality, domain-specific mutants for Simulink-Stateflow models. We develop an automated pipeline that converts Simulink-Stateflow models to structured JSON representations and systematically evaluates different mutation and prompting strategies across eight state-of-the-art LLMs. Through a comprehensive empirical study involving 38,400 LLM-generated mutants across four Simulink-Stateflow models, we demonstrate that LLMs generate mutants up to 13x faster than a manually engineered mutation-based baseline while producing significantly fewer equivalent and duplicate mutants and consistently achieving superior mutant quality. Moreover, our analysis reveals that few-shot prompting combined with low-to-medium temperature values yields optimal results. We provide an open-source prototype tool and release our complete dataset to facilitate reproducibility and advance future research in this domain.

Exploring the Potential of Large Language Models in Simulink-Stateflow Mutant Generation

TL;DR

Abstract

Exploring the Potential of Large Language Models in Simulink-Stateflow Mutant Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)