AndroLibZoo: A Reliable Dataset of Libraries Based on Software Dependency Analysis
Jordan Samhi, Tegawendé F. Bissyandé, Jacques Klein
TL;DR
The paper tackles the challenge of distinguishing library code from developer code in Android apps to improve static analysis scalability and accuracy. It presents AndroLibZoo, an automated, up-to-date dataset of third-party libraries built by mining Maven, Google Maven, and open-source Android projects, followed by a refinement step to ensure library-only results. The dataset contains 34,813 package names and is designed to evolve, with artifacts publicly available to support research and practice. The work demonstrates that a comprehensive, construction-based library whitelist can aid static analyzers in reducing noise and improving precision, while outlining limitations and future directions for integration and enhancement.
Abstract
Android app developers extensively employ code reuse, integrating many third-party libraries into their apps. While such integration is practical for developers, it can be challenging for static analyzers to achieve scalability and precision when libraries account for a large part of the code. As a direct consequence, it is common practice in the literature to consider developer code only during static analysis --with the assumption that the sought issues are in developer code rather than the libraries. However, analysts need to distinguish between library and developer code. Currently, many static analyses rely on white lists of libraries. However, these white lists are unreliable, inaccurate, and largely non-comprehensive. In this paper, we propose a new approach to address the lack of comprehensive and automated solutions for the production of accurate and ``always up to date" sets of libraries. First, we demonstrate the continued need for a white list of libraries. Second, we propose an automated approach to produce an accurate and up-to-date set of third-party libraries in the form of a dataset called AndroLibZoo. Our dataset, which we make available to the community, contains to date 34 813 libraries and is meant to evolve.
