Semantic Code Graph -- an information model to facilitate software comprehension
Krzysztof Borowski, Bartosz Baliś, Tomasz Orzechowski
TL;DR
The paper tackles the challenge of software comprehension in large codebases by introducing Semantic Code Graph (SCG), a close-to-source-code information model that captures diverse code dependencies with precise location data. It formalizes SCG, provides language-specific implementations for Java and Scala, and offers a protobuf-based storage format and a scalable extraction pipeline, including handling bytecode-related challenges. Through an empirical study on eleven Java/Scala projects, SCG is compared against Call Graph (CG) and Class Collaboration Network (CCN), demonstrating richer, more actionable insights for project structure, critical entities, interactive visualization, and software mining. The authors provide open tools (scg-cli) and data (protobuf SCG graphs) to support reproducibility and practical adoption, arguing that SCG enables more effective refactoring, visualization, and maintenance planning. Overall, SCG serves as a comprehensive, extensible foundation that unifies multiple perspectives on software dependencies and facilitates integration with external analysis workflows, with potential to reduce maintenance costs and accelerate comprehension workflows.
Abstract
Software comprehension can be extremely time-consuming due to the ever-growing size of codebases. Consequently, there is an increasing need to accelerate the code comprehension process to facilitate maintenance and reduce associated costs. A crucial aspect of this process is understanding and preserving the high quality of the code dependency structure. While a variety of code structure models already exist, there is a surprising lack of models that closely represent the source code and focus on software comprehension. As a result, there are no readily available and easy-to-use tools to assist with dependency comprehension, refactoring, and quality monitoring of code. To address this gap, we propose the Semantic Code Graph (SCG), an information model that offers a detailed abstract representation of code dependencies with a close relationship to the source code. To validate the SCG model's usefulness in software comprehension, we compare it to nine other source code representation models. Additionally, we select 11 well-known and widely-used open-source projects developed in Java and Scala and perform a range of software comprehension activities on them using three different code representation models: the proposed SCG, the Call Graph (CG), and the Class Collaboration Network (CCN). We then qualitatively analyze the results to compare the performance of these models in terms of software comprehension capabilities. These activities encompass project structure comprehension, identifying critical project entities, interactive visualization of code dependencies, and uncovering code similarities through software mining. Our findings demonstrate that the SCG enhances software comprehension capabilities compared to the prevailing CCN and CG models. We believe that the work described is a step towards the next generation of tools that streamline code dependency comprehension and management.
