SelectFormer: Private and Practical Data Selection for Transformers
Xu Ouyang, Felix Xiaozhu Lin, Yangfeng Ji
TL;DR
SelectFormer introduces a practical framework for private data selection for Transformer-based models by enabling MPC-based evaluation while preserving data and model privacy. The core idea fuses nonlinear Transformer operations into low-dimensional MLP proxies, uses multi-phase selection to progressively filter data, and orchestrates parallel MPC to hide latency, achieving tens-of-hours delays with only about 0.20% accuracy loss on the final model. The approach is validated across NLP and CV benchmarks with multiple target models, demonstrating substantial speedups over direct MPC evaluation and competitive accuracy relative to gold-standard selection. This work advances privacy-preserving data purchasing and coreference data-market concepts, enabling scalable, privacy-aware data acquisition for large-scale Transformer training.
Abstract
Critical to a free data market is $\textit{private data selection}$, i.e. the model owner selects and then appraises training data from the data owner before both parties commit to a transaction. To keep the data and model private, this process shall evaluate the target model to be trained over Multi-Party Computation (MPC). While prior work suggests that evaluating Transformer-based models over MPC is prohibitively expensive, this paper makes it practical for the purpose of data selection. Our contributions are three: (1) a new pipeline for private data selection over MPC; (2) emulating high-dimensional nonlinear operators with low-dimension MLPs, which are trained on a small sample of the data of interest; (3) scheduling MPC in a parallel, multiphase fashion. We evaluate our method on diverse Transformer models and NLP/CV benchmarks. Compared to directly evaluating the target model over MPC, our method reduces the delay from thousands of hours to tens of hours, while only seeing around 0.20% accuracy degradation from training with the selected data.
