Table of Contents
Fetching ...

Orthogonal Procrustes problem preserves correlations in synthetic data

Oussama Ounissi, Nicklas Jävergård, Adrian Muntean

TL;DR

This work addresses preserving inter-feature Pearson correlations in synthetic tabular data without overhauling existing generation methods. It shows that correlation preservation can be achieved through a mean-centered Orthogonal Procrustes transformation with a diagonal scaling, yielding a closest synthetic matrix $\hat{S}$ that matches the original correlation matrix while controlling means and variances. The key contributions include a formal characterization Corr$(O)=$Corr$(S)$ via $M\bar{O}N=\bar{S}$ and a practical construction $\hat{S}=M\bar{O}N+T$ derived from the SVD of $\bar{S}(\bar{O}N)^T$, enabling exact correlation replication with minimal impact on marginals. An empirical demonstration on a large energy-consumption dataset from Madeira shows that the method can realign the correlations of a naive synthetic set to match the original, while largely preserving feature distributions, indicating a lightweight, effective post-processing step for synthetic data pipelines.

Abstract

This work introduces the application of the Orthogonal Procrustes problem to the generation of synthetic data. The proposed methodology ensures that the resulting synthetic data preserves important statistical relationships among features, specifically the Pearson correlation. An empirical illustration using a large, real-world, tabular dataset of energy consumption demonstrates the effectiveness of the approach and highlights its potential for application in practical synthetic data generation. Our approach is not meant to replace existing generative models, but rather as a lightweight post-processing step that enforces exact Pearson correlation to an already generated synthetic dataset.

Orthogonal Procrustes problem preserves correlations in synthetic data

TL;DR

This work addresses preserving inter-feature Pearson correlations in synthetic tabular data without overhauling existing generation methods. It shows that correlation preservation can be achieved through a mean-centered Orthogonal Procrustes transformation with a diagonal scaling, yielding a closest synthetic matrix that matches the original correlation matrix while controlling means and variances. The key contributions include a formal characterization CorrCorr via and a practical construction derived from the SVD of , enabling exact correlation replication with minimal impact on marginals. An empirical demonstration on a large energy-consumption dataset from Madeira shows that the method can realign the correlations of a naive synthetic set to match the original, while largely preserving feature distributions, indicating a lightweight, effective post-processing step for synthetic data pipelines.

Abstract

This work introduces the application of the Orthogonal Procrustes problem to the generation of synthetic data. The proposed methodology ensures that the resulting synthetic data preserves important statistical relationships among features, specifically the Pearson correlation. An empirical illustration using a large, real-world, tabular dataset of energy consumption demonstrates the effectiveness of the approach and highlights its potential for application in practical synthetic data generation. Our approach is not meant to replace existing generative models, but rather as a lightweight post-processing step that enforces exact Pearson correlation to an already generated synthetic dataset.

Paper Structure

This paper contains 7 sections, 4 theorems, 4 equations, 2 figures.

Key Result

Lemma 1

With the previous setting, it follows that $\hbox{S}_{\hbox{c}}(O) = \hbox{S}_{\hbox{c}}(S)$ if and only if there exist an orthogonal matrix $M \in \mathbb{R}^{p\times n}$, and a diagonal matrix $N \in \mathbb{R}^{m\times m}$, such that $M O N = S$.

Figures (2)

  • Figure 1: The Pearson correlation matrices of the original dataset and three datasets generated from it. From top to bottom left to right: $O$ - the original, $\hat{O}$ - \ref{['eq:svd-construct']} applied to $O$, $S$ - generated from individual empirical distributions of $O$, $\hat{S}$ - the result of employing formula \ref{['eq:svd-construct']} applied to $S$.
  • Figure 2: Comparison of the distributions of the 5 features in each dataset. From top to bottom, current $I$, voltage $V$, power $P$, power factor $PF$, and, reactive power $Q$. Three lines are basically on top of each other: $O$: red solid line, $S$: dashed blue line, and $\hat{O}$: green dotted line. The dashed yellow line shows $\hat{S}$, which is the final version of our synthetic dataset that has the same correlations as the original $O$ as shown in Figure \ref{['fig:Correlations']}.

Theorems & Definitions (6)

  • Lemma 1
  • Remark 1
  • Lemma 2
  • Lemma 2
  • Theorem 1
  • Remark 2