Embedding Privacy in Computational Social Science and Artificial Intelligence Research
Keenan Jones, Fatima Zahrah, Jason R. C. Nurse
TL;DR
This paper addresses privacy challenges in computational social science (CSS) and artificial intelligence (AI) arising from large-scale human data and generative models. It advocates embedding privacy by design through frameworks like the Data Privacy Impact Assessment (DPIA) and regulatory alignment, while carefully managing data collection, storage, and dissemination to prevent reidentification and downstream harms such as model leakage. The authors provide a structured set of guidance across research design, data handling, analysis, and publication to help researchers, editors, and policymakers implement privacy-preserving practices. The practical impact is to institutionalize privacy as a core research constraint, reducing harm and increasing trust in CSS and AI studies.
Abstract
Privacy is a human right. It ensures that individuals are free to engage in discussions, participate in groups, and form relationships online or offline without fear of their data being inappropriately harvested, analyzed, or otherwise used to harm them. Preserving privacy has emerged as a critical factor in research, particularly in the computational social science (CSS), artificial intelligence (AI) and data science domains, given their reliance on individuals' data for novel insights. The increasing use of advanced computational models stands to exacerbate privacy concerns because, if inappropriately used, they can quickly infringe privacy rights and lead to adverse effects for individuals -- especially vulnerable groups -- and society. We have already witnessed a host of privacy issues emerge with the advent of large language models (LLMs), such as ChatGPT, which further demonstrate the importance of embedding privacy from the start. This article contributes to the field by discussing the role of privacy and the issues that researchers working in CSS, AI, data science and related domains are likely to face. It then presents several key considerations for researchers to ensure participant privacy is best preserved in their research design, data collection and use, analysis, and dissemination of research results.
