Quantifying the Generalization Gap: A New Benchmark for Out-of-Distribution Graph-Based Android Malware Classification
Ngoc N. Tran, Anwar Said, Waseem Abbas, Tyler Derr, Xenofon D. Koutsoukos
TL;DR
Graph-based Android malware detectors show high in-distribution accuracy but substantial generalization gaps under distribution shift. The authors introduce MalNet-Tiny-Common (covariate shift) and MalNet-Tiny-Distinct (domain shift) to benchmark robustness, and propose a semantic enrichment framework that augments function-call graphs with metadata and LLM embeddings, coupled with three feature-colation strategies. Empirical results show that semantic features consistently boost robustness and complement adaptation methods, enabling more resilient detection under evolving threats. The work provides publicly released datasets and a scalable pipeline to drive semantics-aware, distribution-aware malware detection research.
Abstract
While graph-based Android malware classifiers achieve over 94% accuracy on standard benchmarks, they exhibit a significant generalization gap under distribution shift, suffering up to 45% performance degradation when encountering unseen malware variants from known families. This work systematically investigates this critical yet overlooked challenge for real-world deployment by introducing a benchmarking suite designed to simulate two prevalent scenarios: MalNet-Tiny-Common for covariate shift, and MalNet-Tiny-Distinct for domain shift. Furthermore, we identify an inherent limitation in existing benchmarks where the inputs are structure-only function call graphs, which fails to capture the latent semantic patterns necessary for robust generalization. To verify this, we construct a semantic enrichment framework that augments the original topology with function-level attributes, including lightweight metadata and LLM-based code embeddings. By providing this expanded feature set, we aim to equip future research with richer behavioral information to facilitate the development of more sophisticated detection techniques. Empirical evaluations confirm the effectiveness of our data-centric methodology, with which classification performs better under distribution shift compared to model-based approaches, and consistently further enhances robustness when used in conjunction. We release our precomputed datasets, along with an extensible implementation of our comprehensive pipeline, to lay the groundwork for building resilient malware detection systems for evolving threat environments.
