Table of Contents
Fetching ...

Trustformer: A Trusted Federated Transformer

Ali Abbasi Tadi, Dima Alhadidi, Luis Rueda

TL;DR

Trustformer introduces a privacy-preserving federated learning approach for Transformers by clustering per-layer weights with k-means and exchanging only centroids, secured with Intel SGX. The method replaces full-weight averaging with centroid averaging, enabling strong privacy while substantially reducing communication overhead. A formal convergence analysis shows that as the number of clusters approaches the full parameter set, the method converges to FedAvg; empirical results on WMT19 Russian-English translation demonstrate competitive translation quality with lower communication costs compared to DP-based baselines. This work offers a practical path toward secure, scalable federated Transformer training from scratch, with potential extensions to personalized FL and dimensionality reduction of centroid representations.

Abstract

Transformers, a cornerstone of deep-learning architectures for sequential data, have achieved state-of-the-art results in tasks like Natural Language Processing (NLP). Models such as BERT and GPT-3 exemplify their success and have driven the rise of large language models (LLMs). However, a critical challenge persists: safeguarding the privacy of data used in LLM training. Privacy-preserving techniques like Federated Learning (FL) offer potential solutions, but practical limitations hinder their effectiveness for Transformer training. Two primary issues are (I) the risk of sensitive information leakage due to aggregation methods like FedAvg or FedSGD, and (II) the high communication overhead caused by the large size of Transformer models. This paper introduces a novel FL method that reduces communication overhead while maintaining competitive utility. Our approach avoids sharing full model weights by simulating a global model locally. We apply k-means clustering to each Transformer layer, compute centroids locally, and transmit only these centroids to the server instead of full weights or gradients. To enhance security, we leverage Intel SGX for secure transmission of centroids. Evaluated on a translation task, our method achieves utility comparable to state-of-the-art baselines while significantly reducing communication costs. This provides a more efficient and privacy-preserving FL solution for Transformer models.

Trustformer: A Trusted Federated Transformer

TL;DR

Trustformer introduces a privacy-preserving federated learning approach for Transformers by clustering per-layer weights with k-means and exchanging only centroids, secured with Intel SGX. The method replaces full-weight averaging with centroid averaging, enabling strong privacy while substantially reducing communication overhead. A formal convergence analysis shows that as the number of clusters approaches the full parameter set, the method converges to FedAvg; empirical results on WMT19 Russian-English translation demonstrate competitive translation quality with lower communication costs compared to DP-based baselines. This work offers a practical path toward secure, scalable federated Transformer training from scratch, with potential extensions to personalized FL and dimensionality reduction of centroid representations.

Abstract

Transformers, a cornerstone of deep-learning architectures for sequential data, have achieved state-of-the-art results in tasks like Natural Language Processing (NLP). Models such as BERT and GPT-3 exemplify their success and have driven the rise of large language models (LLMs). However, a critical challenge persists: safeguarding the privacy of data used in LLM training. Privacy-preserving techniques like Federated Learning (FL) offer potential solutions, but practical limitations hinder their effectiveness for Transformer training. Two primary issues are (I) the risk of sensitive information leakage due to aggregation methods like FedAvg or FedSGD, and (II) the high communication overhead caused by the large size of Transformer models. This paper introduces a novel FL method that reduces communication overhead while maintaining competitive utility. Our approach avoids sharing full model weights by simulating a global model locally. We apply k-means clustering to each Transformer layer, compute centroids locally, and transmit only these centroids to the server instead of full weights or gradients. To enhance security, we leverage Intel SGX for secure transmission of centroids. Evaluated on a translation task, our method achieves utility comparable to state-of-the-art baselines while significantly reducing communication costs. This provides a more efficient and privacy-preserving FL solution for Transformer models.
Paper Structure (31 sections, 8 theorems, 26 equations, 6 figures, 5 tables, 5 algorithms)

This paper contains 31 sections, 8 theorems, 26 equations, 6 figures, 5 tables, 5 algorithms.

Key Result

Theorem 1

In a federated learning setting where each client clusters their local model parameters $w_i \in \mathbb{R}^r$ into $No_c$ clusters and updates their model by adjusting parameters based on the differences between global and local centroids, the clients' updated models $w_i^{\text{global}}$ converge

Figures (6)

  • Figure 1: Architecture
  • Figure 2: Training and aggregation
  • Figure 3: Example of Simulating the global model by using the difference between global centroids and local centroids. For simplicity, We assumed there is only one trainable layer with two neurons and 300 inputs. The weights of the neurons are shown as $f_1$ and $f_2$. Since there are 300 inputs, 2 trainable weights and only one layer, we have a matrix of 300 rows and 2 columns as the model. a) Clustered local model with $No_c=3$. b) Distance between global centroids features and local centroids features are computed as set of differences ($Diff_1=\{(\Delta f_1, \Delta f_2)_{\hat{c}_1}, (\Delta f_1, \Delta f_2)_{\hat{c}_2}, (\Delta f_1, \Delta f_2)_{\hat{c}_3}\}$). c) Simulation of the global model by moving each data point according to the differences. $(\Delta f_1, \Delta f_2)_{\hat{c}_j}$ is added to each data point in cluster $S^i_{1,j}$.
  • Figure 4: Loss value in three clients with the varying number of clusters. FedAvg is the baseline method which provides no privacy. However, DP-FedSAM and Trustformer are privacy-preserving methods.
  • Figure 5: Loss value in three clients with the varying number of clusters. FedAvg is the baseline method which provides no privacy. However, DP-FedAvg, DP-BLUR-LUS, and Trustformer are privacy-preserving methods.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Theorem 1
  • proof
  • Lemma 1: Parameter Adjustment Formula
  • proof
  • Lemma 2: Average of Updated Parameters Equals FedAvg
  • proof
  • Lemma 3: Bound on Parameter Difference $\delta_{i,l}$
  • proof
  • Lemma 4: Bound on Centroid Difference $\eta_{i,j}$
  • proof
  • ...and 7 more