Table of Contents
Fetching ...

Assessing the Effectiveness of LLMs in Android Application Vulnerability Analysis

Vasileios Kouliaridis, Georgios Karopoulos, Georgios Kambourakis

TL;DR

This work tackles the problem of understanding how well large language models can identify and suggest fixes for Android code vulnerabilities aligned with the OWASP Mobile Top 10. By benchmarking nine LLMs on a curated Vulcorpus of 100 vulnerable Java samples and comparing them to two static analysis tools, the authors reveal a mixed landscape: GPT-4 and Code Llama perform best overall, yet performance is highly vulnerability-dependent and can be significantly improved through retrieval-augmented generation. The study also demonstrates that RAG can substantially enhance detection and remediation for Code Llama, highlighting practical implications for secure mobile development. Despite promising results, the authors emphasize the need for larger, more diverse datasets and targeted model improvements to achieve robust, production-ready vulnerability analysis using LLMs.

Abstract

The increasing frequency of attacks on Android applications coupled with the recent popularity of large language models (LLMs) necessitates a comprehensive understanding of the capabilities of the latter in identifying potential vulnerabilities, which is key to mitigate the overall risk. To this end, the work at hand compares the ability of nine state-of-the-art LLMs to detect Android code vulnerabilities listed in the latest Open Worldwide Application Security Project (OWASP) Mobile Top 10. Each LLM was evaluated against an open dataset of over 100 vulnerable code samples, including obfuscated ones, assessing each model's ability to identify key vulnerabilities. Our analysis reveals the strengths and weaknesses of each LLM, identifying important factors that contribute to their performance. Additionally, we offer insights into context augmentation with retrieval-augmented generation (RAG) for detecting Android code vulnerabilities, which in turn may propel secure application development. Finally, while the reported findings regarding code vulnerability analysis show promise, they also reveal significant discrepancies among the different LLMs.

Assessing the Effectiveness of LLMs in Android Application Vulnerability Analysis

TL;DR

This work tackles the problem of understanding how well large language models can identify and suggest fixes for Android code vulnerabilities aligned with the OWASP Mobile Top 10. By benchmarking nine LLMs on a curated Vulcorpus of 100 vulnerable Java samples and comparing them to two static analysis tools, the authors reveal a mixed landscape: GPT-4 and Code Llama perform best overall, yet performance is highly vulnerability-dependent and can be significantly improved through retrieval-augmented generation. The study also demonstrates that RAG can substantially enhance detection and remediation for Code Llama, highlighting practical implications for secure mobile development. Despite promising results, the authors emphasize the need for larger, more diverse datasets and targeted model improvements to achieve robust, production-ready vulnerability analysis using LLMs.

Abstract

The increasing frequency of attacks on Android applications coupled with the recent popularity of large language models (LLMs) necessitates a comprehensive understanding of the capabilities of the latter in identifying potential vulnerabilities, which is key to mitigate the overall risk. To this end, the work at hand compares the ability of nine state-of-the-art LLMs to detect Android code vulnerabilities listed in the latest Open Worldwide Application Security Project (OWASP) Mobile Top 10. Each LLM was evaluated against an open dataset of over 100 vulnerable code samples, including obfuscated ones, assessing each model's ability to identify key vulnerabilities. Our analysis reveals the strengths and weaknesses of each LLM, identifying important factors that contribute to their performance. Additionally, we offer insights into context augmentation with retrieval-augmented generation (RAG) for detecting Android code vulnerabilities, which in turn may propel secure application development. Finally, while the reported findings regarding code vulnerability analysis show promise, they also reveal significant discrepancies among the different LLMs.

Paper Structure

This paper contains 9 sections, 5 tables.