Table of Contents
Fetching ...

AndroLibZoo: A Reliable Dataset of Libraries Based on Software Dependency Analysis

Jordan Samhi, Tegawendé F. Bissyandé, Jacques Klein

TL;DR

The paper tackles the challenge of distinguishing library code from developer code in Android apps to improve static analysis scalability and accuracy. It presents AndroLibZoo, an automated, up-to-date dataset of third-party libraries built by mining Maven, Google Maven, and open-source Android projects, followed by a refinement step to ensure library-only results. The dataset contains 34,813 package names and is designed to evolve, with artifacts publicly available to support research and practice. The work demonstrates that a comprehensive, construction-based library whitelist can aid static analyzers in reducing noise and improving precision, while outlining limitations and future directions for integration and enhancement.

Abstract

Android app developers extensively employ code reuse, integrating many third-party libraries into their apps. While such integration is practical for developers, it can be challenging for static analyzers to achieve scalability and precision when libraries account for a large part of the code. As a direct consequence, it is common practice in the literature to consider developer code only during static analysis --with the assumption that the sought issues are in developer code rather than the libraries. However, analysts need to distinguish between library and developer code. Currently, many static analyses rely on white lists of libraries. However, these white lists are unreliable, inaccurate, and largely non-comprehensive. In this paper, we propose a new approach to address the lack of comprehensive and automated solutions for the production of accurate and ``always up to date" sets of libraries. First, we demonstrate the continued need for a white list of libraries. Second, we propose an automated approach to produce an accurate and up-to-date set of third-party libraries in the form of a dataset called AndroLibZoo. Our dataset, which we make available to the community, contains to date 34 813 libraries and is meant to evolve.

AndroLibZoo: A Reliable Dataset of Libraries Based on Software Dependency Analysis

TL;DR

The paper tackles the challenge of distinguishing library code from developer code in Android apps to improve static analysis scalability and accuracy. It presents AndroLibZoo, an automated, up-to-date dataset of third-party libraries built by mining Maven, Google Maven, and open-source Android projects, followed by a refinement step to ensure library-only results. The dataset contains 34,813 package names and is designed to evolve, with artifacts publicly available to support research and practice. The work demonstrates that a comprehensive, construction-based library whitelist can aid static analyzers in reducing noise and improving precision, while outlining limitations and future directions for integration and enhancement.

Abstract

Android app developers extensively employ code reuse, integrating many third-party libraries into their apps. While such integration is practical for developers, it can be challenging for static analyzers to achieve scalability and precision when libraries account for a large part of the code. As a direct consequence, it is common practice in the literature to consider developer code only during static analysis --with the assumption that the sought issues are in developer code rather than the libraries. However, analysts need to distinguish between library and developer code. Currently, many static analyses rely on white lists of libraries. However, these white lists are unreliable, inaccurate, and largely non-comprehensive. In this paper, we propose a new approach to address the lack of comprehensive and automated solutions for the production of accurate and ``always up to date" sets of libraries. First, we demonstrate the continued need for a white list of libraries. Second, we propose an automated approach to produce an accurate and up-to-date set of third-party libraries in the form of a dataset called AndroLibZoo. Our dataset, which we make available to the community, contains to date 34 813 libraries and is meant to evolve.
Paper Structure (12 sections, 3 figures, 3 tables)

This paper contains 12 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Proportion of FQCNs within the apps' package and obfuscated FQCNs
  • Figure 2: Overview of our methodology to construct AndroLibZoo, a collection of third-party libraries.
  • Figure 3: Dependency tree of the com.android.tools.sdk-common library version 22.9.0.