Transparent NLP: Using RAG and LLM Alignment for Privacy Q&A
Anna Leschanowsky, Zahra Kolagar, Erion Çano, Ivan Habernal, Dara Hallinan, Emanuël A. P. Habets, Birgit Popp
TL;DR
The paper tackles GDPR transparency challenges in NLP by evaluating Retrieval Augmented Generation (RAG) systems augmented with alignment modules, specifically Rewindable Auto-regressive Inference (RAIN) and its multidimensional extension MultiRAIN, using a Privacy Q&A dataset (expert_privacy_qa). It introduces a rigorous experimental framework with nine systems across three experiments and 21 evaluation metrics, including LLM-as-a-judge and deterministic measures, plus PCA analysis to explore metric relationships. Results show that alignment-enabled systems generally outperform vanilla RAG on most metrics, though none reach human-level precision across all criteria, and PCA exposes complex, sometimes conflicting, metric relationships and gaps in current measurement approaches. The work provides a foundation for integrating deep NLP systems into GDPR compliance workflows and outlines concrete directions for improving alignment methods, metric design, and the legal analysis that underpins automated transparency claims.
Abstract
The transparency principle of the General Data Protection Regulation (GDPR) requires data processing information to be clear, precise, and accessible. While language models show promise in this context, their probabilistic nature complicates truthfulness and comprehensibility. This paper examines state-of-the-art Retrieval Augmented Generation (RAG) systems enhanced with alignment techniques to fulfill GDPR obligations. We evaluate RAG systems incorporating an alignment module like Rewindable Auto-regressive Inference (RAIN) and our proposed multidimensional extension, MultiRAIN, using a Privacy Q&A dataset. Responses are optimized for preciseness and comprehensibility and are assessed through 21 metrics, including deterministic and large language model-based evaluations. Our results show that RAG systems with an alignment module outperform baseline RAG systems on most metrics, though none fully match human answers. Principal component analysis of the results reveals complex interactions between metrics, highlighting the need to refine metrics. This study provides a foundation for integrating advanced natural language processing systems into legal compliance frameworks.
