Table of Contents
Fetching ...

SelectFormer: Private and Practical Data Selection for Transformers

Xu Ouyang, Felix Xiaozhu Lin, Yangfeng Ji

TL;DR

SelectFormer introduces a practical framework for private data selection for Transformer-based models by enabling MPC-based evaluation while preserving data and model privacy. The core idea fuses nonlinear Transformer operations into low-dimensional MLP proxies, uses multi-phase selection to progressively filter data, and orchestrates parallel MPC to hide latency, achieving tens-of-hours delays with only about 0.20% accuracy loss on the final model. The approach is validated across NLP and CV benchmarks with multiple target models, demonstrating substantial speedups over direct MPC evaluation and competitive accuracy relative to gold-standard selection. This work advances privacy-preserving data purchasing and coreference data-market concepts, enabling scalable, privacy-aware data acquisition for large-scale Transformer training.

Abstract

Critical to a free data market is $\textit{private data selection}$, i.e. the model owner selects and then appraises training data from the data owner before both parties commit to a transaction. To keep the data and model private, this process shall evaluate the target model to be trained over Multi-Party Computation (MPC). While prior work suggests that evaluating Transformer-based models over MPC is prohibitively expensive, this paper makes it practical for the purpose of data selection. Our contributions are three: (1) a new pipeline for private data selection over MPC; (2) emulating high-dimensional nonlinear operators with low-dimension MLPs, which are trained on a small sample of the data of interest; (3) scheduling MPC in a parallel, multiphase fashion. We evaluate our method on diverse Transformer models and NLP/CV benchmarks. Compared to directly evaluating the target model over MPC, our method reduces the delay from thousands of hours to tens of hours, while only seeing around 0.20% accuracy degradation from training with the selected data.

SelectFormer: Private and Practical Data Selection for Transformers

TL;DR

SelectFormer introduces a practical framework for private data selection for Transformer-based models by enabling MPC-based evaluation while preserving data and model privacy. The core idea fuses nonlinear Transformer operations into low-dimensional MLP proxies, uses multi-phase selection to progressively filter data, and orchestrates parallel MPC to hide latency, achieving tens-of-hours delays with only about 0.20% accuracy loss on the final model. The approach is validated across NLP and CV benchmarks with multiple target models, demonstrating substantial speedups over direct MPC evaluation and competitive accuracy relative to gold-standard selection. This work advances privacy-preserving data purchasing and coreference data-market concepts, enabling scalable, privacy-aware data acquisition for large-scale Transformer training.

Abstract

Critical to a free data market is , i.e. the model owner selects and then appraises training data from the data owner before both parties commit to a transaction. To keep the data and model private, this process shall evaluate the target model to be trained over Multi-Party Computation (MPC). While prior work suggests that evaluating Transformer-based models over MPC is prohibitively expensive, this paper makes it practical for the purpose of data selection. Our contributions are three: (1) a new pipeline for private data selection over MPC; (2) emulating high-dimensional nonlinear operators with low-dimension MLPs, which are trained on a small sample of the data of interest; (3) scheduling MPC in a parallel, multiphase fashion. We evaluate our method on diverse Transformer models and NLP/CV benchmarks. Compared to directly evaluating the target model over MPC, our method reduces the delay from thousands of hours to tens of hours, while only seeing around 0.20% accuracy degradation from training with the selected data.
Paper Structure (49 sections, 8 figures, 8 tables)

This paper contains 49 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Three stages of our data selection workflow in chronological order.
  • Figure 2: Transformers over MPC incurs high communication and computation overhead. Showing one forward pass of one layer (12 heads) over a batch of 5 (maximum allowable on our GPU). Hardware: Quadro RTX 4000. MPC framework: Crypten knott2021crypten.
  • Figure 3: Our high-level workflow. Left: proxy model generation. Right: multi-phase selection. M$_g$ is the base model for the proxy models $\hat{M}_{1..N}$. More details in \ref{['sec:design:model']}.
  • Figure 4: Training MLPs for substituting the non-linearity in Transformer models
  • Figure 5: Across a variety of budgets, Ours consistently outperforms Random and are comparable with Oracle (gold accuracy). Target model: DistilBERT.
  • ...and 3 more figures