Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation

Weixing Zhang; Bowen Jiang; Yuhong Fu; Anne Koziolek; Regina Hebig; Daniel Strüber

Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation

Weixing Zhang, Bowen Jiang, Yuhong Fu, Anne Koziolek, Regina Hebig, Daniel Strüber

TL;DR

The paper addresses co evolving grammar definitions and textual DSL instances using LLMs to preserve human oriented information. It introduces a three fold methodology combining case language selection grammar change characterization and an LLM driven co evolution workflow implemented via Python scripts and a Domainmodel inspired prompt. Across ten real world case languages the study reports high correctness and information preservation for small scale evolutions with Claude outperforming GPT especially on larger more complex changes and notes the need for prompt and model specific tuning for scalability. The findings offer practical guidance on when LLM based co evolution is viable and point to hybrid or incremental approaches for handling complex grammar evolutions in real world DSL engineering.

Abstract

Software languages evolve over time for reasons such as feature additions. When grammars evolve, textual instances that originally conformed to them may become outdated. While model-driven engineering provides many techniques for co-evolving models with metamodel changes, these approaches are not designed for textual DSLs and may lose human-relevant information such as layout and comments. This study systematically evaluates the potential of large language models (LLMs) for co-evolving grammars and instances of textual DSLs. Using Claude Sonnet 4.5 and GPT-5.2 across ten case languages with ten runs each, we assess both correctness and preservation of human-oriented information. Results show strong performance on small-scale cases ($\geq$94% precision and recall for instances requiring fewer than 20 modified lines), but performance degraded with scale: Claude maintains 85% recall at 40 lines, while GPT fails on the largest instances. Response time increases substantially with instance size, and grammar evolution complexity and deletion granularity affect performance more than change type. These findings clarify when LLM-based co-evolution is effective and where current limitations remain.

Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation

TL;DR

Abstract

94% precision and recall for instances requiring fewer than 20 modified lines), but performance degraded with scale: Claude maintains 85% recall at 40 lines, while GPT fails on the largest instances. Response time increases substantially with instance size, and grammar evolution complexity and deletion granularity affect performance more than change type. These findings clarify when LLM-based co-evolution is effective and where current limitations remain.

Paper Structure (43 sections, 2 figures, 10 tables)

This paper contains 43 sections, 2 figures, 10 tables.

Introduction
Problem Description and Motivation
Background
Language Evolution
Xtext-based DSLs
Large Language Models
Methodology
Case Language Selection
Characterization of Grammar Changes
Solution Design
Python Script Development
Prompt Optimization
Evaluation Metrics
Correctness Metrics
Human-Oriented Information Preservation Metrics
...and 28 more sections

Figures (2)

Figure 1: When the grammar evolves, textual instances that originally adhered to it may no longer conform to it, and so need to be co-evolved to conform to it.
Figure 2: Three-step research methodology.

Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation

TL;DR

Abstract

Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)