Comparison of Static Analysis Architecture Recovery Tools for Microservice Applications
Simon Schneider, Alexander Bakhtin, Xiaozhou Li, Jacopo Soldani, Antonio Brogi, Tomas Cerny, Riccardo Scandariato, Davide Taibi
TL;DR
The paper surveys static architecture recovery tools for microservice applications, identifying 13 tools and empirically comparing nine on a common dataset extended with 63 endpoints. It quantifies extraction accuracy via precision, recall, and F1, revealing that Code2DFD and MicroDepGraph excel for components, while Code2DFD andRAD-family tools perform well on endpoints, with combinations achieving up to 0.91 F1. The study highlights reproducibility challenges, emphasizes that deployment-file parsing yields fast, high-precision results, and argues that deeper code analysis improves recall, suggesting near-perfect extraction is feasible with mature tool integration. Practically, the results guide practitioners in selecting tools and inform researchers about promising directions for more accurate and scalable microservice architecture recovery.
Abstract
Architecture recovery tools help software engineers obtain an overview of the structure of their software systems during all phases of the software development life cycle. This is especially important for microservice applications because they consist of multiple interacting microservices, which makes it more challenging to oversee the architecture. Various tools and techniques for architecture recovery (also called architecture reconstruction) have been presented in academic and gray literature sources, but no overview and comparison of their accuracy exists. This paper presents the results of a multivocal literature review with the goal of identifying architecture recovery tools for microservice applications and a comparison of the identified tools' architectural recovery accuracy. We focused on static tools since they can be integrated into fast-paced CI/CD pipelines. 13 such tools were identified from the literature and nine of them could be executed and compared on their capability of detecting different system characteristics. The best-performing tool exhibited an overall F1-score of 0.86. Additionally, the possibility of combining multiple tools to increase the recovery correctness was investigated, yielding a combination of four individual tools that achieves an F1-score of 0.91. Registered report: The methodology of this study has been peer-reviewed and accepted as a registered report at MSR'24: arXiv:2403.06941
