GDPR-Bench-Android: A Benchmark for Evaluating Automated GDPR Compliance Detection in Android
Huaijin Ran, Haoyi Zhang, Xunzhu Tang
TL;DR
GDPR-Bench-Android tackles the challenge of automated GDPR violation detection in Android source code by introducing a large, legally grounded dataset of 1951 annotated violations across 15 repositories and a novel formal baseline, Formal-AST. The paper defines two complementary tasks—multi-granularity violation localization and snippet-level multi-label classification—and evaluates 11 methods spanning symbolic, neural, retrieval-based, and agentic paradigms. Key findings show that no single approach excels across all tasks: agentic methods lead in long-context file-level analysis, LLMs dominate line-level detection, and RAG improves precision in multi-label classification, while Formal-AST provides a transparent baseline but struggles with semantic nuance. The work demonstrates the value of a multi-paradigm benchmark for diagnosing strengths and limitations of automated GDPR-compliance tools, and releases open data, prompts, and scripts to foster further research in regulation-aware software engineering.
Abstract
Automating the detection of EU General Data Protection Regulation (GDPR) violations in source code is a critical but underexplored challenge. We introduce \textbf{GDPR-Bench-Android}, the first comprehensive benchmark for evaluating diverse automated methods for GDPR compliance detection in Android applications. It contains \textbf{1951} manually annotated violation instances from \textbf{15} open-source repositories, covering 23 GDPR articles at file-, module-, and line-level granularities. To enable a multi-paradigm evaluation, we contribute \textbf{Formal-AST}, a novel, source-code-native formal method that serves as a deterministic baseline. We define two tasks: (1) \emph{multi-granularity violation localization}, evaluated via Accuracy@\textit{k}; and (2) \emph{snippet-level multi-label classification}, assessed by macro-F1 and other classification metrics. We benchmark 11 methods, including eight state-of-the-art LLMs, our Formal-AST analyzer, a retrieval-augmented (RAG) method, and an agentic (ReAct) method. Our findings reveal that no single paradigm excels across all tasks. For Task 1, the ReAct agent achieves the highest file-level Accuracy@1 (17.38%), while the Qwen2.5-72B LLM leads at the line level (61.60%), in stark contrast to the Formal-AST method's 1.86%. For the difficult multi-label Task 2, the Claude-Sonnet-4.5 LLM achieves the best Macro-F1 (5.75%), while the RAG method yields the highest Macro-Precision (7.10%). These results highlight the task-dependent strengths of different automated approaches and underscore the value of our benchmark in diagnosing their capabilities. All resources are available at: https://github.com/Haoyi-Zhang/GDPR-Bench-Android.
