Datasheets for Healthcare AI: A Framework for Transparency and Bias Mitigation
Marjia Siddik, Harshvardhan J. Pandit
TL;DR
The paper addresses bias, data incompleteness, and inaccuracies in healthcare AI datasets by proposing a Healthcare AI Datasheet—an extended, machine-readable documentation framework tailored to healthcare. It builds on the Datasheets for Datasets foundation, incorporating expanded bias categories, explicit risk assessments, and regulatory information aligned with GDPR and the EU AI Act, and demonstrates its fit within the Irish healthcare context. The authors describe an iterative methodology to derive 55 fields across 10 sections, implemented as a JSON schema with plans for DCAT, ODRL, and DPV integration. They show how such documentation supports transparency, accountability, and compliant data reuse, with potential to automate risk assessments and better prepare for EU data-spaces initiatives like EHDS. Limitations include domain specificity and reliance on accurate field submissions, with future work focusing on real-world testing and broader applicability.
Abstract
The use of AI in healthcare has the potential to improve patient care, optimize clinical workflows, and enhance decision-making. However, bias, data incompleteness, and inaccuracies in training datasets can lead to unfair outcomes and amplify existing disparities. This research investigates the current state of dataset documentation practices, focusing on their ability to address these challenges and support ethical AI development. We identify shortcomings in existing documentation methods, which limit the recognition and mitigation of bias, incompleteness, and other issues in datasets. We propose the 'Healthcare AI Datasheet' to address these gaps, a dataset documentation framework that promotes transparency and ensures alignment with regulatory requirements. Additionally, we demonstrate how it can be expressed in a machine-readable format, facilitating its integration with datasets and enabling automated risk assessments. The findings emphasise the importance of dataset documentation in fostering responsible AI development.
