Table of Contents
Fetching ...

Which Code Statements Implement Privacy Behaviors in Android Applications?

Chia-Yi Su, Aakash Bansal, Vijayanta Jain, Sepideh Ghanavati, Sai Teja Peddinti, Collin McMillan

TL;DR

This work tackles the fine-grained mapping between code statements and privacy behaviors in Android apps. It first conducts a human study with 18 programmers to identify which AST statements are most related to privacy labels, finding that expression statements involving function calls are predominant and that the privacy label type has little effect on statement attributes. It then proposes an LLM-based approach to automatically detect privacy-relevant statements by fine-tuning three models on the study data, demonstrating that model predictions can match or exceed human agreement on both relevance and order in many cases, with a smaller domain-specific model (Jam) often performing best. The results provide a practical path for keeping privacy labels aligned with evolving code and offer a valuable dataset (2,426 method-label annotations) and prompts to support developers in reasoning about privacy in mobile software.

Abstract

A "privacy behavior" in software is an action where the software uses personal information for a service or a feature, such as a website using location to provide content relevant to a user. Programmers are required by regulations or application stores to provide privacy notices and labels describing these privacy behaviors. Although many tools and research prototypes have been developed to help programmers generate these notices by analyzing the source code, these approaches are often fairly coarse-grained (i.e., at the level of whole methods or files, rather than at the statement level). But this is not necessarily how privacy behaviors exist in code. Privacy behaviors are embedded in specific statements in code. Current literature does not examine what statements programmers see as most important, how consistent these views are, or how to detect them. In this paper, we conduct an empirical study to examine which statements programmers view as most-related to privacy behaviors. We find that expression statements that make function calls are most associated with privacy behaviors, while the type of privacy label has little effect on the attributes of the selected statements. We then propose an approach to automatically detect these privacy-relevant statements by fine-tuning three large language models with the data from the study. We observe that the agreement between our approach and participants is comparable to or higher than an agreement between two participants. Our study and detection approach can help programmers understand which statements in code affect privacy in mobile applications.

Which Code Statements Implement Privacy Behaviors in Android Applications?

TL;DR

This work tackles the fine-grained mapping between code statements and privacy behaviors in Android apps. It first conducts a human study with 18 programmers to identify which AST statements are most related to privacy labels, finding that expression statements involving function calls are predominant and that the privacy label type has little effect on statement attributes. It then proposes an LLM-based approach to automatically detect privacy-relevant statements by fine-tuning three models on the study data, demonstrating that model predictions can match or exceed human agreement on both relevance and order in many cases, with a smaller domain-specific model (Jam) often performing best. The results provide a practical path for keeping privacy labels aligned with evolving code and offer a valuable dataset (2,426 method-label annotations) and prompts to support developers in reasoning about privacy in mobile software.

Abstract

A "privacy behavior" in software is an action where the software uses personal information for a service or a feature, such as a website using location to provide content relevant to a user. Programmers are required by regulations or application stores to provide privacy notices and labels describing these privacy behaviors. Although many tools and research prototypes have been developed to help programmers generate these notices by analyzing the source code, these approaches are often fairly coarse-grained (i.e., at the level of whole methods or files, rather than at the statement level). But this is not necessarily how privacy behaviors exist in code. Privacy behaviors are embedded in specific statements in code. Current literature does not examine what statements programmers see as most important, how consistent these views are, or how to detect them. In this paper, we conduct an empirical study to examine which statements programmers view as most-related to privacy behaviors. We find that expression statements that make function calls are most associated with privacy behaviors, while the type of privacy label has little effect on the attributes of the selected statements. We then propose an approach to automatically detect these privacy-relevant statements by fine-tuning three large language models with the data from the study. We observe that the agreement between our approach and participants is comparable to or higher than an agreement between two participants. Our study and detection approach can help programmers understand which statements in code affect privacy in mobile applications.

Paper Structure

This paper contains 35 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: A motivating example of how "fine-grained" analysis highlights the embedded privacy behavior. The numbers imply the relevance of the statement to the 'Advertisement' label in the 'purpose' category in a decreasing order.
  • Figure 2: Snapshot of recent research approaches for generating and analyzing privacy notices. $T$ denotes the use of templates. $N$ denotes the use of Neural Networks. Column $I$ denotes approaches for inconsistency detection. $D$ denotes approaches that analyze the developer's descriptions and answers to questionnaires. Column $A$ denotes approaches that analyze API calls. $C$ denotes approaches that use code comprehension.
  • Figure 3: Overview of our study. We discuss our web-survey study and statement categorization in Section \ref{['sec:study']}. We discuss the results of our empirical analysis in Section \ref{['sec:Analysis']}. We discuss the features of our automated approach in Section \ref{['sec:approach']}, and the evaluation of that approach in Section \ref{['sec:eval']}.
  • Figure 4: A snapshot from our web-survey.
  • Figure 5: Charts showing the normalized distribution of statement categories for: (a) all statements in the methods,(b) all ratings from participants, (c) all statements without the func_call, and (d) all ratings without the func_call.
  • ...and 3 more figures