Table of Contents
Fetching ...

Redistributor: Transforming Empirical Data Distributions

Pavol Harar, Dennis Elbrächter, Monika Dörfler, Kory D. Johnson

TL;DR

Redistributor addresses the problem of transforming one empirical distribution into another, or into a known target, by composing the target CDF's inverse with the source CDF ($R=F_T^{-1}\circ F_S$). It provides practical estimators (KDE-based and linear-interpolated eCDFs), robust handling of duplicates and boundaries, and an efficient Python/Scikit-learn implementation, along with a solid Hadamard-differentiability framework guaranteeing consistency and asymptotic normality. The paper demonstrates broad applicability in image processing (color correction, photorealistic style transfer, photomosaics), data augmentation, and ML preprocessing, and shows favorable comparisons to model-based methods and neural approaches in terms of content fidelity and computational efficiency. Overall, Redistributor offers a principled, scalable, and interpretable tool for distribution matching with practical impact across vision, signal processing, and machine learning pipelines.

Abstract

We present an algorithm and package, Redistributor, which forces a collection of scalar samples to follow a desired distribution. When given independent and identically distributed samples of some random variable $S$ and the continuous cumulative distribution function of some desired target $T$, it provably produces a consistent estimator of the transformation $R$ which satisfies $R(S)=T$ in distribution. As the distribution of $S$ or $T$ may be unknown, we also include algorithms for efficiently estimating these distributions from samples. This allows for various interesting use cases in image processing, where Redistributor serves as a remarkably simple and easy-to-use tool that is capable of producing visually appealing results. For color correction it outperforms other model-based methods and excels in achieving photorealistic style transfer, surpassing deep learning methods in content preservation. The package is implemented in Python and is optimized to efficiently handle large datasets, making it also suitable as a preprocessing step in machine learning. The source code is available at https://github.com/paloha/redistributor.

Redistributor: Transforming Empirical Data Distributions

TL;DR

Redistributor addresses the problem of transforming one empirical distribution into another, or into a known target, by composing the target CDF's inverse with the source CDF (). It provides practical estimators (KDE-based and linear-interpolated eCDFs), robust handling of duplicates and boundaries, and an efficient Python/Scikit-learn implementation, along with a solid Hadamard-differentiability framework guaranteeing consistency and asymptotic normality. The paper demonstrates broad applicability in image processing (color correction, photorealistic style transfer, photomosaics), data augmentation, and ML preprocessing, and shows favorable comparisons to model-based methods and neural approaches in terms of content fidelity and computational efficiency. Overall, Redistributor offers a principled, scalable, and interpretable tool for distribution matching with practical impact across vision, signal processing, and machine learning pipelines.

Abstract

We present an algorithm and package, Redistributor, which forces a collection of scalar samples to follow a desired distribution. When given independent and identically distributed samples of some random variable and the continuous cumulative distribution function of some desired target , it provably produces a consistent estimator of the transformation which satisfies in distribution. As the distribution of or may be unknown, we also include algorithms for efficiently estimating these distributions from samples. This allows for various interesting use cases in image processing, where Redistributor serves as a remarkably simple and easy-to-use tool that is capable of producing visually appealing results. For color correction it outperforms other model-based methods and excels in achieving photorealistic style transfer, surpassing deep learning methods in content preservation. The package is implemented in Python and is optimized to efficiently handle large datasets, making it also suitable as a preprocessing step in machine learning. The source code is available at https://github.com/paloha/redistributor.
Paper Structure (23 sections, 6 theorems, 48 equations, 13 figures)

This paper contains 23 sections, 6 theorems, 48 equations, 13 figures.

Key Result

Theorem 6.2

Let $(X,\|\cdot\|_X)$, $(Y,\|\cdot\|_Y)$ be Banach spaces, $D\subseteq X$ and let $\varphi\colon D\to Y$ be Hadamard differentiable at $F\in X$. Let $G$ and $F_n$, $n\in\mathbb{N}$, be $D$-valued random variablesWe do not explicitly discuss the underlying probability spaces, but refer the interested If $\varphi'[F]$ is defined and continuous on the entirety of $X$, we also haveFor a definition of

Figures (13)

  • Figure 1: Matching colors of a reference image -- one of the use cases of Redistributor from \ref{['sec:use_cases']}.
  • Figure 2: Applying the transformation $R$ from $\hat{F}_S$ to $\hat{F}_T$, where $\hat{F}_S$ is an estimate of a Double Gamma distribution $F_S$ obtained by \ref{['alg:L']} from 1000 iid samples, and $\hat{F}_T$ is an estimate of a Gaussian distribution $F_T$ obtained by \ref{['alg:L']} from 1000 iid samples. Subfigures (a) and (c) display density histograms.
  • Figure 3: Treating boundary values in \ref{['alg:L']} -- The simplest possible example using only 3 data points. Each subfigure shows where the supported values of respective functions map based on whether the boundaries are explicitly set or not. Endpoints denote the "valid" support, i.e. the interval where the function is strictly increasing. E.g., if the boundary $a$ is not set, all CDF values from the interval [$-\infty$, $min$] map to the constant $\Delta = 1 / (bins + 1)$. The PPF maps values from the interval [0, $\Delta$] to $min$, i.e. the minimum value of the provided data. Analogously, the same applies for the boundary value $b$, which can be specified or not independently of $a$.
  • Figure 4: Timing \ref{['alg:L', 'alg:KDEW']} on a consumer grade CPU -- Intel ® Core™ i7-8565U 1.80GHz. In both subfigures, N denotes the number of input data points. In (a), K denotes the number of bins and in (b), K denotes the grid density. Note that in comparison to \ref{['alg:KDEW']}, \ref{['alg:L']} can handle approx. 4 orders of magnitude more data points in the same amount of time making it applicable in larger data-processing pipelines.
  • Figure 5: Correcting exposure using a reference image.
  • ...and 8 more figures

Theorems & Definitions (10)

  • Definition 6.1
  • Theorem 6.2: vaart_1998, Thm.20.8
  • Lemma 6.3
  • proof
  • Theorem 6.4
  • Theorem 6.5: vaart_1998, Thm.18.11
  • Lemma 6.6
  • proof
  • Proposition 6.7
  • proof