Table of Contents
Fetching ...

Federated Learning in Chemical Engineering: A Tutorial on a Framework for Privacy-Preserving Collaboration Across Distributed Data Sources

Siddhant Dutta, Iago Leal de Freitas, Pedro Maciel Xavier, Claudio Miceli de Farias, David Esteban Bernal Neira

TL;DR

This paper surveys Federated Learning (FL) in chemical engineering, presenting both theoretical foundations and a concrete, hands-on tutorial. It compares FL with centralized learning, addresses non-IID data and robust aggregation methods, and demonstrates FL in pill defect classification, multimodal DNA+MRI classification, and HIV drug-discovery tasks using Flower and TensorFlow Federated. Across these case studies, FL often matches centralized performance, with notable gains in generalization for non-IID, distributed data and clear privacy-preserving benefits; however, multimodal and complex datasets reveal remaining challenges in achieving uniform gains across all modalities. The work highlights practical deployment considerations, scalability improvements, and future directions (encryption, blockchain, ensembles, TinyML, and quantum variants) to advance privacy-preserving collaborative modeling in industrial chemical engineering. The findings support FL as a viable path toward privacy-aware collaboration across distributed chemical data sources, with significant implications for predictive maintenance, drug discovery, and materials research.

Abstract

Federated Learning (FL) is a decentralized machine learning approach that has gained attention for its potential to enable collaborative model training across clients while protecting data privacy, making it an attractive solution for the chemical industry. This work aims to provide the chemical engineering community with an accessible introduction to the discipline. Supported by a hands-on tutorial and a comprehensive collection of examples, it explores the application of FL in tasks such as manufacturing optimization, multimodal data integration, and drug discovery while addressing the unique challenges of protecting proprietary information and managing distributed datasets. The tutorial was built using key frameworks such as $\texttt{Flower}$ and $\texttt{TensorFlow Federated}$ and was designed to provide chemical engineers with the right tools to adopt FL in their specific needs. We compare the performance of FL against centralized learning across three different datasets relevant to chemical engineering applications, demonstrating that FL will often maintain or improve classification performance, particularly for complex and heterogeneous data. We conclude with an outlook on the open challenges in federated learning to be tackled and current approaches designed to remediate and improve this framework.

Federated Learning in Chemical Engineering: A Tutorial on a Framework for Privacy-Preserving Collaboration Across Distributed Data Sources

TL;DR

This paper surveys Federated Learning (FL) in chemical engineering, presenting both theoretical foundations and a concrete, hands-on tutorial. It compares FL with centralized learning, addresses non-IID data and robust aggregation methods, and demonstrates FL in pill defect classification, multimodal DNA+MRI classification, and HIV drug-discovery tasks using Flower and TensorFlow Federated. Across these case studies, FL often matches centralized performance, with notable gains in generalization for non-IID, distributed data and clear privacy-preserving benefits; however, multimodal and complex datasets reveal remaining challenges in achieving uniform gains across all modalities. The work highlights practical deployment considerations, scalability improvements, and future directions (encryption, blockchain, ensembles, TinyML, and quantum variants) to advance privacy-preserving collaborative modeling in industrial chemical engineering. The findings support FL as a viable path toward privacy-aware collaboration across distributed chemical data sources, with significant implications for predictive maintenance, drug discovery, and materials research.

Abstract

Federated Learning (FL) is a decentralized machine learning approach that has gained attention for its potential to enable collaborative model training across clients while protecting data privacy, making it an attractive solution for the chemical industry. This work aims to provide the chemical engineering community with an accessible introduction to the discipline. Supported by a hands-on tutorial and a comprehensive collection of examples, it explores the application of FL in tasks such as manufacturing optimization, multimodal data integration, and drug discovery while addressing the unique challenges of protecting proprietary information and managing distributed datasets. The tutorial was built using key frameworks such as and and was designed to provide chemical engineers with the right tools to adopt FL in their specific needs. We compare the performance of FL against centralized learning across three different datasets relevant to chemical engineering applications, demonstrating that FL will often maintain or improve classification performance, particularly for complex and heterogeneous data. We conclude with an outlook on the open challenges in federated learning to be tackled and current approaches designed to remediate and improve this framework.

Paper Structure

This paper contains 24 sections, 7 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: Each client trains their respective models on local data. After each training round, they transmit their local weights $\theta_k^{(t)}$ to a central server for aggregation. The new global model $\theta^{(t+1)}$ is then distributed to all clients. They substitute it as their current model and proceed with another training round. This way, the clients have access to global information but never directly observe each other's datasets img_factory1img_factory2.
  • Figure 2: Example pills taken from the dataset Bergmann_2019Bergmann_2021. The pills above are defective due to scratches, color changes, and faulty imprints, respectively. Notice that for a lay observer, it may be difficult to distinguish between faulty and good pills.
  • Figure 3: Confusion matrix for the Pill dataset trained with Flower.
  • Figure 4: Receiver operating characteristic curve (ROC) for the Pill dataset.
  • Figure 5: The architecture for the DNA+MRI MMoE network starts by feeding the input to separate specialized networks, followed by both a gating network and a combination of their weights, and, finally, a shared layer before splitting the outputs into different modalities. Notice that while both inputs and outputs are separate, the hidden layers are shared between DNA and MRI.
  • ...and 7 more figures