Federated Learning in Chemical Engineering: A Tutorial on a Framework for Privacy-Preserving Collaboration Across Distributed Data Sources
Siddhant Dutta, Iago Leal de Freitas, Pedro Maciel Xavier, Claudio Miceli de Farias, David Esteban Bernal Neira
TL;DR
This paper surveys Federated Learning (FL) in chemical engineering, presenting both theoretical foundations and a concrete, hands-on tutorial. It compares FL with centralized learning, addresses non-IID data and robust aggregation methods, and demonstrates FL in pill defect classification, multimodal DNA+MRI classification, and HIV drug-discovery tasks using Flower and TensorFlow Federated. Across these case studies, FL often matches centralized performance, with notable gains in generalization for non-IID, distributed data and clear privacy-preserving benefits; however, multimodal and complex datasets reveal remaining challenges in achieving uniform gains across all modalities. The work highlights practical deployment considerations, scalability improvements, and future directions (encryption, blockchain, ensembles, TinyML, and quantum variants) to advance privacy-preserving collaborative modeling in industrial chemical engineering. The findings support FL as a viable path toward privacy-aware collaboration across distributed chemical data sources, with significant implications for predictive maintenance, drug discovery, and materials research.
Abstract
Federated Learning (FL) is a decentralized machine learning approach that has gained attention for its potential to enable collaborative model training across clients while protecting data privacy, making it an attractive solution for the chemical industry. This work aims to provide the chemical engineering community with an accessible introduction to the discipline. Supported by a hands-on tutorial and a comprehensive collection of examples, it explores the application of FL in tasks such as manufacturing optimization, multimodal data integration, and drug discovery while addressing the unique challenges of protecting proprietary information and managing distributed datasets. The tutorial was built using key frameworks such as $\texttt{Flower}$ and $\texttt{TensorFlow Federated}$ and was designed to provide chemical engineers with the right tools to adopt FL in their specific needs. We compare the performance of FL against centralized learning across three different datasets relevant to chemical engineering applications, demonstrating that FL will often maintain or improve classification performance, particularly for complex and heterogeneous data. We conclude with an outlook on the open challenges in federated learning to be tackled and current approaches designed to remediate and improve this framework.
