Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Development
Xin Peng, Chong Wang
TL;DR
The paper addresses the gap between current LLM capabilities and the realities of enterprise software development, where tacit knowledge and long-term evolution play critical roles. It proposes the Code Digital Twin, a living knowledge framework that jointly models software artifacts and high-level conceptual knowledge, and co-evolves with the codebase through hybrid representations, multi-stage extraction, and human-in-the-loop feedback. The authors outline a concrete methodology and roadmap, including knowledge representation, a construction pipeline, co-evolution mechanisms, LLM-powered applications, and evaluation via preliminary case studies in issue localization and Android application generation. Early results suggest that structuring knowledge around concepts and functionalities enables more accurate reasoning and safer, more coherent development across large, complex systems. If realized at scale, the Code Digital Twin could bridge AI capabilities and enterprise realities, enabling sustainable, context-aware software evolution for ultra-complex systems.
Abstract
Recent advances in large language models (LLMs) have demonstrated strong capabilities in software engineering tasks, raising expectations of revolutionary productivity gains. However, enterprise software development is largely driven by incremental evolution, where challenges extend far beyond routine coding and depend critically on tacit knowledge, including design decisions at different levels and historical trade-offs. To achieve effective AI-powered support for complex software development, we should align emerging AI capabilities with the practical realities of enterprise development. To this end, we systematically identify challenges from both software and LLM perspectives. Alongside these challenges, we outline opportunities where AI and structured knowledge frameworks can enhance decision-making in tasks such as issue localization and impact analysis. To address these needs, we propose the Code Digital Twin, a living framework that models both the physical and conceptual layers of software, preserves tacit knowledge, and co-evolves with the codebase. By integrating hybrid knowledge representations, multi-stage extraction pipelines, incremental updates, LLM-empowered applications, and human-in-the-loop feedback, the Code Digital Twin transforms fragmented knowledge into explicit and actionable representations. Our vision positions it as a bridge between AI advancements and enterprise software realities, providing a concrete roadmap toward sustainable, intelligent, and resilient development and evolution of ultra-complex systems.
