Permissioned Blockchain-based Framework for Ranking Synthetic Data Generators
Narasimha Raghavan Veeraragavan, Mohammad Hossein Tabatabaei, Severin Elvatun, Vibeke Binz Vallevik, Siri Larønningen, Jan F Nygård
TL;DR
This work presents a permissioned blockchain-based framework, built on Sawtooth, to rank synthetic data generators per specific purposes while ensuring transparency, accountability, and auditability under GDPR and AI Act considerations. It introduces a novel ranking algorithm that balances desirable and undesirable properties via hierarchical weights for quality indicators and metrics, and deploys this logic as smart contracts. The framework is validated through experiments using health-related synthetic data generators and benchmarks against baselines, demonstrating accurate rankings and practical blockchain performance. The approach supports compliant, purpose-driven selection of synthetic data generators with an auditable, tamper-evident trail of decisions.
Abstract
Synthetic data generation is increasingly recognized as a crucial solution to address data related challenges such as scarcity, bias, and privacy concerns. As synthetic data proliferates, the need for a robust evaluation framework to select a synthetic data generator becomes more pressing given the variety of options available. In this research study, we investigate two primary questions: 1) How can we select the most suitable synthetic data generator from a set of options for a specific purpose? 2) How can we make the selection process more transparent, accountable, and auditable? To address these questions, we introduce a novel approach in which the proposed ranking algorithm is implemented as a smart contract within a permissioned blockchain framework called Sawtooth. Through comprehensive experiments and comparisons with state-of-the-art baseline ranking solutions, our framework demonstrates its effectiveness in providing nuanced rankings that consider both desirable and undesirable properties. Furthermore, our framework serves as a valuable tool for selecting the optimal synthetic data generators for specific needs while ensuring compliance with data protection principles.
