Censoring chemical data to mitigate dual use risk
Quintina L. Campbell, Jonathan Herington, Andrew D. White
TL;DR
The paper addresses dual-use risks in predictive chemistry by proposing data-level mitigation through selective noise applied to sensitive data regions, aiming to preserve openness while reducing misuse. It formally analyzes and empirically tests how perturbing either molecular features (via SMILES replacements) or labels in sensitive regions affects model bias and variance across 1D, MLP, and GCN tasks, including lipophilicity prediction. Key finding: selective feature noise can induce attenuation bias in sensitive regions and decrease predictive accuracy for dangerous compounds, while omission of sensitive data fails to prevent extrapolation in deep learning models; label noise increases variance. The approach offers a model-agnostic path to safer open data sharing, though it requires further refinement to balance protection with accuracy in the non-sensitive region and to handle multiple sensitivity levels.
Abstract
Machine learning models have dual-use potential, potentially serving both beneficial and malicious purposes. The development of open-source models in chemistry has specifically surfaced dual-use concerns around toxicological data and chemical warfare agents. We discuss a chain risk framework identifying three misuse pathways and corresponding mitigation strategies: inference-level, model-level, and data-level. At the data level, we introduce a model-agnostic noising method to increase prediction error in specific desired regions (sensitive regions). Our results show that selective noise induces variance and attenuation bias, whereas simply omitting sensitive data fails to prevent extrapolation. These findings hold for both molecular feature multilayer perceptrons and graph neural networks. Thus, noising molecular structures can enable open sharing of potential dual-use molecular data.
