Table of Contents
Fetching ...

GDPR-Bench-Android: A Benchmark for Evaluating Automated GDPR Compliance Detection in Android

Huaijin Ran, Haoyi Zhang, Xunzhu Tang

TL;DR

GDPR-Bench-Android tackles the challenge of automated GDPR violation detection in Android source code by introducing a large, legally grounded dataset of 1951 annotated violations across 15 repositories and a novel formal baseline, Formal-AST. The paper defines two complementary tasks—multi-granularity violation localization and snippet-level multi-label classification—and evaluates 11 methods spanning symbolic, neural, retrieval-based, and agentic paradigms. Key findings show that no single approach excels across all tasks: agentic methods lead in long-context file-level analysis, LLMs dominate line-level detection, and RAG improves precision in multi-label classification, while Formal-AST provides a transparent baseline but struggles with semantic nuance. The work demonstrates the value of a multi-paradigm benchmark for diagnosing strengths and limitations of automated GDPR-compliance tools, and releases open data, prompts, and scripts to foster further research in regulation-aware software engineering.

Abstract

Automating the detection of EU General Data Protection Regulation (GDPR) violations in source code is a critical but underexplored challenge. We introduce \textbf{GDPR-Bench-Android}, the first comprehensive benchmark for evaluating diverse automated methods for GDPR compliance detection in Android applications. It contains \textbf{1951} manually annotated violation instances from \textbf{15} open-source repositories, covering 23 GDPR articles at file-, module-, and line-level granularities. To enable a multi-paradigm evaluation, we contribute \textbf{Formal-AST}, a novel, source-code-native formal method that serves as a deterministic baseline. We define two tasks: (1) \emph{multi-granularity violation localization}, evaluated via Accuracy@\textit{k}; and (2) \emph{snippet-level multi-label classification}, assessed by macro-F1 and other classification metrics. We benchmark 11 methods, including eight state-of-the-art LLMs, our Formal-AST analyzer, a retrieval-augmented (RAG) method, and an agentic (ReAct) method. Our findings reveal that no single paradigm excels across all tasks. For Task 1, the ReAct agent achieves the highest file-level Accuracy@1 (17.38%), while the Qwen2.5-72B LLM leads at the line level (61.60%), in stark contrast to the Formal-AST method's 1.86%. For the difficult multi-label Task 2, the Claude-Sonnet-4.5 LLM achieves the best Macro-F1 (5.75%), while the RAG method yields the highest Macro-Precision (7.10%). These results highlight the task-dependent strengths of different automated approaches and underscore the value of our benchmark in diagnosing their capabilities. All resources are available at: https://github.com/Haoyi-Zhang/GDPR-Bench-Android.

GDPR-Bench-Android: A Benchmark for Evaluating Automated GDPR Compliance Detection in Android

TL;DR

GDPR-Bench-Android tackles the challenge of automated GDPR violation detection in Android source code by introducing a large, legally grounded dataset of 1951 annotated violations across 15 repositories and a novel formal baseline, Formal-AST. The paper defines two complementary tasks—multi-granularity violation localization and snippet-level multi-label classification—and evaluates 11 methods spanning symbolic, neural, retrieval-based, and agentic paradigms. Key findings show that no single approach excels across all tasks: agentic methods lead in long-context file-level analysis, LLMs dominate line-level detection, and RAG improves precision in multi-label classification, while Formal-AST provides a transparent baseline but struggles with semantic nuance. The work demonstrates the value of a multi-paradigm benchmark for diagnosing strengths and limitations of automated GDPR-compliance tools, and releases open data, prompts, and scripts to foster further research in regulation-aware software engineering.

Abstract

Automating the detection of EU General Data Protection Regulation (GDPR) violations in source code is a critical but underexplored challenge. We introduce \textbf{GDPR-Bench-Android}, the first comprehensive benchmark for evaluating diverse automated methods for GDPR compliance detection in Android applications. It contains \textbf{1951} manually annotated violation instances from \textbf{15} open-source repositories, covering 23 GDPR articles at file-, module-, and line-level granularities. To enable a multi-paradigm evaluation, we contribute \textbf{Formal-AST}, a novel, source-code-native formal method that serves as a deterministic baseline. We define two tasks: (1) \emph{multi-granularity violation localization}, evaluated via Accuracy@\textit{k}; and (2) \emph{snippet-level multi-label classification}, assessed by macro-F1 and other classification metrics. We benchmark 11 methods, including eight state-of-the-art LLMs, our Formal-AST analyzer, a retrieval-augmented (RAG) method, and an agentic (ReAct) method. Our findings reveal that no single paradigm excels across all tasks. For Task 1, the ReAct agent achieves the highest file-level Accuracy@1 (17.38%), while the Qwen2.5-72B LLM leads at the line level (61.60%), in stark contrast to the Formal-AST method's 1.86%. For the difficult multi-label Task 2, the Claude-Sonnet-4.5 LLM achieves the best Macro-F1 (5.75%), while the RAG method yields the highest Macro-Precision (7.10%). These results highlight the task-dependent strengths of different automated approaches and underscore the value of our benchmark in diagnosing their capabilities. All resources are available at: https://github.com/Haoyi-Zhang/GDPR-Bench-Android.

Paper Structure

This paper contains 86 sections, 3 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: A High-Level Overview of the GDPR-Bench-Android Framework.
  • Figure 2: A conceptual overview of the problem motivation, illustrating the gap between the real-world need for GDPR compliance and the limitations of existing automated methods, which motivates the creation of our benchmark.
  • Figure 3: Updated Dataset Construction Pipeline.
  • Figure 4: GDPR violation distribution across the 15 applications in the expanded dataset. The visualizations show a heavy-tailed distribution, with AhMyth-Android-RAT contributing the majority of instances (1044).
  • Figure 5: Distribution of violations across all 23 GDPR articles. Articles 6 (442), 5 (430), 25 (311), and 32 (254) are the most frequent. This concentration highlights key compliance failure patterns.
  • ...and 4 more figures