Table of Contents
Fetching ...

Distributed Generalized Linear Models: A Privacy-Preserving Approach

Daniel Tinoco, Raquel Menezes, Carlos Baquero

TL;DR

This work tackles privacy-aware model fitting in distributed and streaming environments by developing a QR-based, privacy-preserving framework for linear regression that supports incremental updates and distributed computation. It extends the approach to generalized linear models by casting IRLS as iterative weighted LS problems solved via transformed coordinates, enabling distributed GLM estimation. The authors demonstrate, through extensive simulated and real-data experiments, that the distributed methods achieve accuracy indistinguishable from centralized implementations while reducing data sharing and leveraging scalable updates. The approach offers a practical, computationally efficient alternative to cryptographic privacy techniques, suitable for federated and streaming data scenarios with semi-honest threat models. Key results show near-identical coefficients and negligible MAE differences across LM and GLM in both synthetic and real datasets (Diamonds, Credit Cards).

Abstract

This paper presents a novel approach to classical linear regression, enabling model computation from data streams or in a distributed setting while preserving data privacy in federated environments. We extend this framework to generalized linear models (GLMs), ensuring scalability and adaptability to diverse data distributions while maintaining privacy-preserving properties. To assess the effectiveness of our approach, we conduct numerical studies on both simulated and real datasets, comparing our method with conventional maximum likelihood estimation for GLMs using iteratively reweighted least squares. Our results demonstrate the advantages of the proposed method in distributed and federated settings.

Distributed Generalized Linear Models: A Privacy-Preserving Approach

TL;DR

This work tackles privacy-aware model fitting in distributed and streaming environments by developing a QR-based, privacy-preserving framework for linear regression that supports incremental updates and distributed computation. It extends the approach to generalized linear models by casting IRLS as iterative weighted LS problems solved via transformed coordinates, enabling distributed GLM estimation. The authors demonstrate, through extensive simulated and real-data experiments, that the distributed methods achieve accuracy indistinguishable from centralized implementations while reducing data sharing and leveraging scalable updates. The approach offers a practical, computationally efficient alternative to cryptographic privacy techniques, suitable for federated and streaming data scenarios with semi-honest threat models. Key results show near-identical coefficients and negligible MAE differences across LM and GLM in both synthetic and real datasets (Diamonds, Credit Cards).

Abstract

This paper presents a novel approach to classical linear regression, enabling model computation from data streams or in a distributed setting while preserving data privacy in federated environments. We extend this framework to generalized linear models (GLMs), ensuring scalability and adaptability to diverse data distributions while maintaining privacy-preserving properties. To assess the effectiveness of our approach, we conduct numerical studies on both simulated and real datasets, comparing our method with conventional maximum likelihood estimation for GLMs using iteratively reweighted least squares. Our results demonstrate the advantages of the proposed method in distributed and federated settings.

Paper Structure

This paper contains 20 sections, 1 theorem, 30 equations, 7 figures, 6 tables, 2 algorithms.

Key Result

proposition A1

It is not possible to recover matrix $\mathbf{X}$ from matrix $\mathbf{R}$ alone in the QR decomposition.

Figures (7)

  • Figure A1: Absolute difference between R lm function and LM distributed algorithm version with the number of observations set to 100, 1000 or 10000 and the number of predictors to 1, 3, 5, or 10 along 100 replicas.
  • Figure A2: Absolute difference between R glm function and GLM Distributed algorithm version with the number of observations set to 100, 1000 or 10000 and the number of predictors to 1, 3, 5, or 10 along 100 replicas.
  • Figure A3: Scatterplot of the relation of price and carat in a hexagon bin count format, with the number of bins set to 50.
  • Figure A4: Mosaic plot of the relation between credit card application acceptance and self-employment.
  • Figure A5: Mean absolute difference between R lm function and LM Distributed algorithm version with the number of observations set to 10000 or 100000 and the number of predictors to 1, 3, or 5 along 100 replicas for the number of virtual nodes from 10 to 100 in increments of 5.
  • ...and 2 more figures

Theorems & Definitions (2)

  • proposition A1
  • proof