Table of Contents
Fetching ...

Data Dependency-Aware Code Generation from Enhanced UML Sequence Diagrams

Wenxin Mao, Zhitao Wang, Long Wang, Sirong Chen, Cuiyun Gao, Luyang Cao, Ziming Liu, Qiming Zhang, Jun Zhou, Zhi Jin

TL;DR

We address the problem of producing reliable code from complex software designs by decoupling data dependencies from control flows using a data dependency inference (DDI) step. The proposed UML2Dep framework enhances UML sequence diagrams with Decision Tables and refined API specifications, couples them with mathematical formalization prompting, and applies reachability-based context pruning to reduce cognitive load on LLMs. Across industrial datasets, UML2Dep achieves an average DDI recall of 89.97% and precision of 95.06% with F1 of 92.33%, and significantly improves downstream code quality, increasing compilation pass rate by 8.83% and full unit test pass rate by 11.66%. The work demonstrates practical value by validating on real-world microservice designs and showing tangible benefits in design validation, code synthesis reliability, and integration into industrial pipelines.

Abstract

Large language models (LLMs) excel at generating code from natural language (NL) descriptions. However, the plain textual descriptions are inherently ambiguous and often fail to capture complex requirements like intricate system behaviors, conditional logic, and architectural constraints; implicit data dependencies in service-oriented architectures are difficult to infer and handle correctly. To bridge this gap, we propose a novel step-by-step code generation framework named UML2Dep by leveraging unambiguous formal specifications of complex requirements. First, we introduce an enhanced Unified Modeling Language (UML) sequence diagram tailored for service-oriented architectures. This diagram extends traditional visual syntax by integrating decision tables and API specifications, explicitly formalizing structural relationships and business logic flows in service interactions to rigorously eliminate linguistic ambiguity. Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference (DDI) task. DDI systematically constructs an explicit data dependency graph prior to actual code synthesis. To ensure reliability, we formalize DDI as a constrained mathematical reasoning task through novel prompting strategies, aligning with LLMs' excellent mathematical strengths. Additional static parsing and dependency pruning further reduce context complexity and cognitive load associated with intricate specifications, thereby enhancing reasoning accuracy and efficiency.

Data Dependency-Aware Code Generation from Enhanced UML Sequence Diagrams

TL;DR

We address the problem of producing reliable code from complex software designs by decoupling data dependencies from control flows using a data dependency inference (DDI) step. The proposed UML2Dep framework enhances UML sequence diagrams with Decision Tables and refined API specifications, couples them with mathematical formalization prompting, and applies reachability-based context pruning to reduce cognitive load on LLMs. Across industrial datasets, UML2Dep achieves an average DDI recall of 89.97% and precision of 95.06% with F1 of 92.33%, and significantly improves downstream code quality, increasing compilation pass rate by 8.83% and full unit test pass rate by 11.66%. The work demonstrates practical value by validating on real-world microservice designs and showing tangible benefits in design validation, code synthesis reliability, and integration into industrial pipelines.

Abstract

Large language models (LLMs) excel at generating code from natural language (NL) descriptions. However, the plain textual descriptions are inherently ambiguous and often fail to capture complex requirements like intricate system behaviors, conditional logic, and architectural constraints; implicit data dependencies in service-oriented architectures are difficult to infer and handle correctly. To bridge this gap, we propose a novel step-by-step code generation framework named UML2Dep by leveraging unambiguous formal specifications of complex requirements. First, we introduce an enhanced Unified Modeling Language (UML) sequence diagram tailored for service-oriented architectures. This diagram extends traditional visual syntax by integrating decision tables and API specifications, explicitly formalizing structural relationships and business logic flows in service interactions to rigorously eliminate linguistic ambiguity. Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference (DDI) task. DDI systematically constructs an explicit data dependency graph prior to actual code synthesis. To ensure reliability, we formalize DDI as a constrained mathematical reasoning task through novel prompting strategies, aligning with LLMs' excellent mathematical strengths. Additional static parsing and dependency pruning further reduce context complexity and cognitive load associated with intricate specifications, thereby enhancing reasoning accuracy and efficiency.

Paper Structure

This paper contains 28 sections, 6 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: The overview of DDI Task and our framework
  • Figure 2: Structured prompt template for DDI mathematical formalization
  • Figure 3: EDG Construction and Reachable Predecessors Identification on EDG
  • Figure 4: Comparison of DDI task results with and without Reachability-Based Context Pruning(RBCP).
  • Figure 5: Example Sequence Diagram for Online Shopping
  • ...and 1 more figures

Theorems & Definitions (8)

  • Definition 1: Data Dependency Inference Problem
  • Definition 2: Data Dependency Graph
  • Definition 3: Data Dependency Node
  • Definition 4: Data Dependency Edge
  • Definition 5: Data Consumption and Production Categories
  • Definition 6: Execution Reachability
  • Definition 7: Context Pruning Problem
  • Definition 8: EDG