Table of Contents
Fetching ...

On Benchmarking Code LLMs for Android Malware Analysis

Yiling He, Hongyu She, Xingzhi Qian, Xinran Zheng, Zhuo Chen, Zhan Qin, Lorenzo Cavallaro

TL;DR

The paper introduces CAMA, a benchmarking framework for systematically evaluating Code LLMs on Android malware analysis. It defines a structured output format (malicious function summaries, refined names, and maliciousness scores) and three domain-specific metrics (consistency, fidelity, semantic relevance) to assess stability and effectiveness. Using a dataset of 118 Android malware samples comprising over 7.5 million functions, the authors compare four open-source Code LLMs and show that instruction-tuned GPT-style models (e.g., StarChat) outperform Seq2Seq baselines, while function renaming can improve fidelity and consistency at a potential cost to semantic clarity. The work highlights both the promise and current limitations of Code LLMs for fine-grained malware analysis and provides a foundation for targeted model adaptation and richer ground-truth data.

Abstract

Large Language Models (LLMs) have demonstrated strong capabilities in various code intelligence tasks. However, their effectiveness for Android malware analysis remains underexplored. Decompiled Android malware code presents unique challenges for analysis, due to the malicious logic being buried within a large number of functions and the frequent lack of meaningful function names. This paper presents CAMA, a benchmarking framework designed to systematically evaluate the effectiveness of Code LLMs in Android malware analysis. CAMA specifies structured model outputs to support key malware analysis tasks, including malicious function identification and malware purpose summarization. Built on these, it integrates three domain-specific evaluation metrics (consistency, fidelity, and semantic relevance), enabling rigorous stability and effectiveness assessment and cross-model comparison. We construct a benchmark dataset of 118 Android malware samples from 13 families collected in recent years, encompassing over 7.5 million distinct functions, and use CAMA to evaluate four popular open-source Code LLMs. Our experiments provide insights into how Code LLMs interpret decompiled code and quantify the sensitivity to function renaming, highlighting both their potential and current limitations in malware analysis.

On Benchmarking Code LLMs for Android Malware Analysis

TL;DR

The paper introduces CAMA, a benchmarking framework for systematically evaluating Code LLMs on Android malware analysis. It defines a structured output format (malicious function summaries, refined names, and maliciousness scores) and three domain-specific metrics (consistency, fidelity, semantic relevance) to assess stability and effectiveness. Using a dataset of 118 Android malware samples comprising over 7.5 million functions, the authors compare four open-source Code LLMs and show that instruction-tuned GPT-style models (e.g., StarChat) outperform Seq2Seq baselines, while function renaming can improve fidelity and consistency at a potential cost to semantic clarity. The work highlights both the promise and current limitations of Code LLMs for fine-grained malware analysis and provides a foundation for targeted model adaptation and richer ground-truth data.

Abstract

Large Language Models (LLMs) have demonstrated strong capabilities in various code intelligence tasks. However, their effectiveness for Android malware analysis remains underexplored. Decompiled Android malware code presents unique challenges for analysis, due to the malicious logic being buried within a large number of functions and the frequent lack of meaningful function names. This paper presents CAMA, a benchmarking framework designed to systematically evaluate the effectiveness of Code LLMs in Android malware analysis. CAMA specifies structured model outputs to support key malware analysis tasks, including malicious function identification and malware purpose summarization. Built on these, it integrates three domain-specific evaluation metrics (consistency, fidelity, and semantic relevance), enabling rigorous stability and effectiveness assessment and cross-model comparison. We construct a benchmark dataset of 118 Android malware samples from 13 families collected in recent years, encompassing over 7.5 million distinct functions, and use CAMA to evaluate four popular open-source Code LLMs. Our experiments provide insights into how Code LLMs interpret decompiled code and quantify the sensitivity to function renaming, highlighting both their potential and current limitations in malware analysis.

Paper Structure

This paper contains 19 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Evaluation pipeline of Cama.
  • Figure 2: Demonstration of the limited capability of CodeT5 in generating a meaningful output when additional requirements are specified.
  • Figure 3: Maliciousness score distributions before and after function renaming. For both models, refined function names lead to more scores concentrated in the middle range.