Android App Feature Extraction: A review of approaches for malware and app similarity detection
Simon Torka, Sahin Albayrak
TL;DR
The paper surveys Android malware, clone, and functional similarity detection literature from 2002 to 2022, catalogs data sources, feature extraction toolchains, and extractable features, and identifies a critical lack of publicly available, universal interdisciplinary datasets. It proposes guidelines and a schematic workflow for creating a comprehensive dataset that harmonizes APK- and app-store–derived features across domains, enabling reproducibility and cross-disciplinary collaboration. The authors argue that such a dataset would enable robust validation, foster synergy between malware, clone, and similarity research, and support practical risk mitigation and app-recommendation systems. The work emphasizes open data practices and careful documentation to enhance reproducibility, with attention to human factors in security decisions and multi-label app classifications. Overall, the paper lays the groundwork for a publicly accessible resource that could accelerate Android security research and improve user protection through cross-domain insights.
Abstract
This paper reviews work published between 2002 and 2022 in the fields of Android malware, clone, and similarity detection. It examines the data sources, tools, and features used in existing research and identifies the need for a comprehensive, cross-domain dataset to facilitate interdisciplinary collaboration and the exploitation of synergies between different research areas. Furthermore, it shows that many research papers do not publish the dataset or a description of how it was created, making it difficult to reproduce or compare the results. The paper highlights the necessity for a dataset that is accessible, well-documented, and suitable for a range of applications. Guidelines are provided for this purpose, along with a schematic method for creating the dataset.
