Orthogonal Procrustes problem preserves correlations in synthetic data
Oussama Ounissi, Nicklas Jävergård, Adrian Muntean
TL;DR
This work addresses preserving inter-feature Pearson correlations in synthetic tabular data without overhauling existing generation methods. It shows that correlation preservation can be achieved through a mean-centered Orthogonal Procrustes transformation with a diagonal scaling, yielding a closest synthetic matrix $\hat{S}$ that matches the original correlation matrix while controlling means and variances. The key contributions include a formal characterization Corr$(O)=$Corr$(S)$ via $M\bar{O}N=\bar{S}$ and a practical construction $\hat{S}=M\bar{O}N+T$ derived from the SVD of $\bar{S}(\bar{O}N)^T$, enabling exact correlation replication with minimal impact on marginals. An empirical demonstration on a large energy-consumption dataset from Madeira shows that the method can realign the correlations of a naive synthetic set to match the original, while largely preserving feature distributions, indicating a lightweight, effective post-processing step for synthetic data pipelines.
Abstract
This work introduces the application of the Orthogonal Procrustes problem to the generation of synthetic data. The proposed methodology ensures that the resulting synthetic data preserves important statistical relationships among features, specifically the Pearson correlation. An empirical illustration using a large, real-world, tabular dataset of energy consumption demonstrates the effectiveness of the approach and highlights its potential for application in practical synthetic data generation. Our approach is not meant to replace existing generative models, but rather as a lightweight post-processing step that enforces exact Pearson correlation to an already generated synthetic dataset.
