Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation
Weixing Zhang, Bowen Jiang, Yuhong Fu, Anne Koziolek, Regina Hebig, Daniel Strüber
TL;DR
The paper addresses co evolving grammar definitions and textual DSL instances using LLMs to preserve human oriented information. It introduces a three fold methodology combining case language selection grammar change characterization and an LLM driven co evolution workflow implemented via Python scripts and a Domainmodel inspired prompt. Across ten real world case languages the study reports high correctness and information preservation for small scale evolutions with Claude outperforming GPT especially on larger more complex changes and notes the need for prompt and model specific tuning for scalability. The findings offer practical guidance on when LLM based co evolution is viable and point to hybrid or incremental approaches for handling complex grammar evolutions in real world DSL engineering.
Abstract
Software languages evolve over time for reasons such as feature additions. When grammars evolve, textual instances that originally conformed to them may become outdated. While model-driven engineering provides many techniques for co-evolving models with metamodel changes, these approaches are not designed for textual DSLs and may lose human-relevant information such as layout and comments. This study systematically evaluates the potential of large language models (LLMs) for co-evolving grammars and instances of textual DSLs. Using Claude Sonnet 4.5 and GPT-5.2 across ten case languages with ten runs each, we assess both correctness and preservation of human-oriented information. Results show strong performance on small-scale cases ($\geq$94% precision and recall for instances requiring fewer than 20 modified lines), but performance degraded with scale: Claude maintains 85% recall at 40 lines, while GPT fails on the largest instances. Response time increases substantially with instance size, and grammar evolution complexity and deletion granularity affect performance more than change type. These findings clarify when LLM-based co-evolution is effective and where current limitations remain.
