Data Collaboration Analysis with Orthonormal Basis Selection and Alignment

Keiyu Nosaka; Yamato Suetake; Yuichi Takano; Akiko Yoshise

Data Collaboration Analysis with Orthonormal Basis Selection and Alignment

Keiyu Nosaka, Yamato Suetake, Yuichi Takano, Akiko Yoshise

TL;DR

This work identifies practical instability in existing Data Collaboration (DC) basis alignment due to target-basis choice and introduces Orthonormal Data Collaboration (ODC), which enforces orthonormal secret and target bases. By reducing alignment to the Orthogonal Procrustes Problem, ODC achieves a closed-form solution and orthogonal concordance, ensuring downstream performance is invariant to the target basis. The approach yields substantial computational speedups (up to or exceeding 100x) and preserves DC's one-shot communication and semi-honest privacy model, with robust performance across a variety of tasks and anchor constructions. Empirical results demonstrate ODC’s speed, stability, and favorable privacy-utility trade-offs relative to Imakura-DC, Kawakami-DC, differential privacy baselines, and federated learning. The work also offers practical deployment guidance, including anchor design strategies and governance considerations for cross-sector collaborations.

Abstract

Data Collaboration (DC) enables multiple parties to jointly train a model by sharing only linear projections of their private datasets. The core challenge in DC is to align the bases of these projections without revealing each party's secret basis. While existing theory suggests that any target basis spanning the common subspace should suffice, in practice, the choice of basis can substantially affect both accuracy and numerical stability. We introduce Orthonormal Data Collaboration (ODC), which enforces orthonormal secret and target bases, thereby reducing alignment to the classical Orthogonal Procrustes problem, which admits a closed-form solution. We prove that the resulting change-of-basis matrices achieve \emph{orthogonal concordance}, aligning all parties' representations up to a shared orthogonal transform and rendering downstream performance invariant to the target basis. Computationally, ODC reduces the alignment complexity from O(\min{a(cl)^2,a^2c}) to O(acl^2), and empirical evaluations show up to $100\times$ speed-ups with equal or better accuracy across benchmarks. ODC preserves DC's one-round communication pattern and privacy assumptions, providing a simple and efficient drop-in improvement to existing DC pipelines.

Data Collaboration Analysis with Orthonormal Basis Selection and Alignment

TL;DR

Abstract

speed-ups with equal or better accuracy across benchmarks. ODC preserves DC's one-round communication pattern and privacy assumptions, providing a simple and efficient drop-in improvement to existing DC pipelines.

Paper Structure (50 sections, 4 theorems, 82 equations, 8 figures, 20 tables, 4 algorithms)

This paper contains 50 sections, 4 theorems, 82 equations, 8 figures, 20 tables, 4 algorithms.

Introduction
Our Contributions
Notations and Organization
Notations
Organization
Preliminaries
The Data Collaboration Algorithm
Privacy Analysis
Standard semi-honest model
Remarks on collusion model
Communication Overhead
Related Works
Procrustes methods in multi-view alignment
Privacy-Preserving Machine Learning
Existing Basis Alignment
...and 35 more sections

Key Result

Theorem 2.1

Privacy Against Semi-Honest Users(Adapted from Theorem 1 in DCprivacy). Any semi-honest user $i$ in the DC framework cannot infer the private dataset $\bm{X}_j$ of any other user $j \neq i$.

Figures (8)

Figure 1: Conceptual illustration of the Orthonormal Data Collaboration (ODC) framework. Each participating user independently projects their private dataset $\bm{X}_i \in \mathbb{R}^{n_i \times m}$ and a common anchor dataset $\bm{A} \in \mathbb{R}^{a \times m}$ into intermediate representations $\tilde{\bm{X}}_i = \bm{X}_i \bm{F}_i$ and $\bm{A}_i = \bm{A}\bm{F}_i$, respectively, using a privately selected orthonormal secret basis $\bm{F}_i \in \mathbb{R}^{m \times \ell}$. To collaboratively train machine learning models without revealing their private raw data, an analyst constructs orthogonal change-of-basis matrices $\bm{G}_i \in \mathcal{O}(\ell):= \{\bm{O} \in \mathbb{R}^{\ell \times \ell}: \bm{O}^\top \bm{O} = \bm{O}\bm{O}^\top = \bm{I}\}$ to align these representations onto a shared orthonormal target basis, without directly accessing the private secret bases. The ODC framework ensures these matrices achieve orthogonal concordance, aligning all user representations up to a common orthogonal transformation. Consequently, the analyst can safely aggregate and analyze the aligned representations $\tilde{\bm{X}}_i \bm{G}_i$ to perform downstream machine learning tasks. ODC explicitly addresses a key practical question: "How can we use only the intermediate anchor representations $\bm{A}_i$ to generate alignment matrices $\bm{G}_i$ without explicitly knowing the secret bases $\bm{F}_i$?" The analytical solution, illustrated in the figure, is $\bm{G}_i = \bm{U}_i \bm{V}_i^\top$, computed via the singular value decomposition $\bm{A}_i^\top \bm{A}_1 \bm{O} = \bm{U}_i \bm{\Sigma}_i \bm{V}_i^\top$, where $\bm{O} \in \mathcal{O}(\ell)$ is an arbitrarily selected orthogonal matrix.
Figure 2: Visual privacy verification using CelebA celeba. Original images (panel (a)) compared to images after orthonormal projections (panel (b)) and random non-orthogonal projections (panel (c)). Both transformations strongly obfuscate the visual content, illustrating that the orthonormality assumption does not compromise visual privacy relative to general projections.
Figure 3: Sensitivity curves for the break-even FL rounds $R^*$ (example settings: $a=10^{3}$, $m=784$, $\ell=100$, $\gamma=c$).
Figure 4: Heatmaps of the threshold number of FL rounds $R^*$ for different participation rates $p$. Brighter regions correspond to larger $R^*$, i.e., more FL rounds are required for DC to be more communication-efficient. The white line indicates the break-point for $R^* = 1$, i.e., regions above this line mean a single FL round costs more than DC.
Figure 5: Absolute communication volume versus quantization bit-width $q$ for the healthcare example ($\gamma=c$).
...and 3 more figures

Theorems & Definitions (10)

Theorem 2.1
proof
Theorem 2.2
proof
Definition 4.2
Theorem 4.3
proof
Definition 5.1
Theorem 5.2
proof

Data Collaboration Analysis with Orthonormal Basis Selection and Alignment

TL;DR

Abstract

Data Collaboration Analysis with Orthonormal Basis Selection and Alignment

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (10)