Table of Contents
Fetching ...

Decentralized Kernel Ridge Regression Based on Data-Dependent Random Feature

Ruikai Yang, Fan He, Mingzhen He, Jie Yang, Xiaolin Huang

TL;DR

This paper tackles decentralized kernel ridge regression under data heterogeneity by introducing data-dependent random features (DDRF) that allow each node to use its own RFs while pursuing consensus on decision functions rather than on feature coefficients. The authors formulate a relaxed, convex objective that couples neighboring nodes via alignment penalties and derive an efficient primal-only update that updates node-level coefficients with precomputed local matrices, achieving the same communication costs as competing methods. They prove convergence under a condition on the self-penalty and validate the approach with experiments on six real-world datasets, showing substantial accuracy gains (e.g., average improvements of 25.5% over data-independent baselines) especially when data distributions differ across nodes and when feature budgets vary. The method, DeKRR-DDRF, offers a flexible, privacy-preserving, communication-efficient framework for decentralized kernel learning that adapts to node-specific data while maintaining network-wide performance gains.

Abstract

Random feature (RF) has been widely used for node consistency in decentralized kernel ridge regression (KRR). Currently, the consistency is guaranteed by imposing constraints on coefficients of features, necessitating that the random features on different nodes are identical. However, in many applications, data on different nodes varies significantly on the number or distribution, which calls for adaptive and data-dependent methods that generate different RFs. To tackle the essential difficulty, we propose a new decentralized KRR algorithm that pursues consensus on decision functions, which allows great flexibility and well adapts data on nodes. The convergence is rigorously given and the effectiveness is numerically verified: by capturing the characteristics of the data on each node, while maintaining the same communication costs as other methods, we achieved an average regression accuracy improvement of 25.5\% across six real-world data sets.

Decentralized Kernel Ridge Regression Based on Data-Dependent Random Feature

TL;DR

This paper tackles decentralized kernel ridge regression under data heterogeneity by introducing data-dependent random features (DDRF) that allow each node to use its own RFs while pursuing consensus on decision functions rather than on feature coefficients. The authors formulate a relaxed, convex objective that couples neighboring nodes via alignment penalties and derive an efficient primal-only update that updates node-level coefficients with precomputed local matrices, achieving the same communication costs as competing methods. They prove convergence under a condition on the self-penalty and validate the approach with experiments on six real-world datasets, showing substantial accuracy gains (e.g., average improvements of 25.5% over data-independent baselines) especially when data distributions differ across nodes and when feature budgets vary. The method, DeKRR-DDRF, offers a flexible, privacy-preserving, communication-efficient framework for decentralized kernel learning that adapts to node-specific data while maintaining network-wide performance gains.

Abstract

Random feature (RF) has been widely used for node consistency in decentralized kernel ridge regression (KRR). Currently, the consistency is guaranteed by imposing constraints on coefficients of features, necessitating that the random features on different nodes are identical. However, in many applications, data on different nodes varies significantly on the number or distribution, which calls for adaptive and data-dependent methods that generate different RFs. To tackle the essential difficulty, we propose a new decentralized KRR algorithm that pursues consensus on decision functions, which allows great flexibility and well adapts data on nodes. The convergence is rigorously given and the effectiveness is numerically verified: by capturing the characteristics of the data on each node, while maintaining the same communication costs as other methods, we achieved an average regression accuracy improvement of 25.5\% across six real-world data sets.
Paper Structure (12 sections, 1 theorem, 39 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 1 theorem, 39 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Supposing that on each node, the objective value of (equ: objectiveFunction) is decreasing, i.e.,

Figures (4)

  • Figure 1: The RSE (mean$\pm$std) on six test sets versus the average number of features used by each node ($\bar{D}$) in the first non-IID data setting (different $\overline{|y_{j,i}|}$). $J=10$, $|\mathcal{N}_j|=4$, and ten nodes contain the same amount of data $N_j$ and the same number of features $D_j$.
  • Figure 2: The RSE (mean$\pm$std) on six test sets versus the average number of features used by each node ($\bar{D}$) in the second non-IID data setting (different $\overline{\|\boldsymbol{x}_{j,i}\|_2}$). $J=10$, $|\mathcal{N}_j|=4$, and ten nodes contain the same amount of data $N_j$ and the same number of features $D_j$.
  • Figure 3: The RSE (mean$\pm$std) on twitter data set versus the average number of features used by each node ($\bar{D}$) in the imbalanced data setting, where the $j$th node has $N_j = \frac{2j-1}{100}N$ data $(\sum_{j=1}^J N_j = N)$. $J=10$, $|\mathcal{N}_j|=4$, $\lambda = 10^{-6}$, and $\sigma=4$. Set $D_{j,\mathrm{Ours}}=\bar{D}$ for the equal $D_j$ setting and $D_{j,\mathrm{Ours}}=\sqrt{N}_jJ\bar{D}/\sum_{j=1}^J\sqrt{N}_j$ for the different $D_j$ setting.
  • Figure 4: Each node's RSE performance on twitter data set in the imbalanced data setting when $\bar{D}=100$. $J=10$, $|\mathcal{N}_j|=4$, $\lambda = 10^{-6}$, $\sigma=4$, and $\sum_{j=1}^J N_j = N$. The number of data from node 1 to node 10 gradually increases, following the setting $N_j=\frac{2j-1}{100} N$. Set $D_{j,\mathrm{Ours}}=\bar{D}$ for the equal $D_j$ setting and $D_{j,\mathrm{Ours}}=\sqrt{N}_jJ\bar{D}/\sum_{j=1}^J\sqrt{N}_j$ for the different $D_j$ setting. By selecting the features and adjusting $D_j$ for each node, the accuracy of big data nodes $(j=6, 7, \ldots, 10)$ is further improved, while the total number of features used by the network remains unchanged.

Theorems & Definitions (1)

  • Proposition 1