Divide and Predict: An Architecture for Input Space Partitioning and Enhanced Accuracy

Fenix W. Huang; Henning S. Mortveit; Christian M. Reidys

Divide and Predict: An Architecture for Input Space Partitioning and Enhanced Accuracy

Fenix W. Huang, Henning S. Mortveit, Christian M. Reidys

TL;DR

The authors establish that variance is maximal for equal mixes of distributions, and detail how variance-based data purification followed by conventional training over blocks can lead to significant increases in test accuracy.

Abstract

In this article the authors develop an intrinsic measure for quantifying heterogeneity in training data for supervised learning. This measure is the variance of a random variable which factors through the influences of pairs of training points. The variance is shown to capture data heterogeneity and can thus be used to assess if a sample is a mixture of distributions. The authors prove that the data itself contains key information that supports a partitioning into blocks. Several proof of concept studies are provided that quantify the connection between variance and heterogeneity for EMNIST image data and synthetic data. The authors establish that variance is maximal for equal mixes of distributions, and detail how variance-based data purification followed by conventional training over blocks can lead to significant increases in test accuracy.

Divide and Predict: An Architecture for Input Space Partitioning and Enhanced Accuracy

TL;DR

Abstract

Paper Structure (11 sections, 10 theorems, 72 equations, 8 figures)

This paper contains 11 sections, 10 theorems, 72 equations, 8 figures.

Introduction
Some preliminary facts
Main results
Applications
EMNIST image data
Synthetic data - two distributions
Synthetic data - three distributions
Discussion
Proofs
Proofs of basic facts
Proofs for main results

Key Result

Lemma 1

Let $f$ be a convex, twice continuously differentiable function $f\colon \mathbb{R}^n\rightarrow \mathbb{R}$ and its second order Taylor approximation at $\theta_0$. Then $f_{2,\theta_0}$ assumes its minimum at

Figures (8)

Figure 1: The two-stage approach involving purification followed by conventional machine learning. Prediction is done using a classifier routing input to the appropriate sub-model.
Figure 2: Outline: three classes of data each analyzed with respect to variance, heterogeneity, test accuracy and subjected to variance-based purification. SD-2 and SD-3 denote synthetic data containing two and three distinct distributions, respectively.
Figure 3: Variance $\mathbb{V}[X]$ (blue curve) and prediction accuracy (orange curve) for EMNIST data as a function of the error rate $r$ applied to training data. All 10 digits were used, with each digit $0$ through $9$ represented by $10^3$ images to obtain $|Z|= 10^4$. We train an MLP model which takes all $28 \times 28$ gray scale pixels as features, with learning rate $10^{-3}$ and batch size $64$, and use the trained model to predict on a disjoint test set of size $10^4$ sampled from EMNIST data. The horizontal axis gives the error rate, and the left (resp. right) vertical axis gives the variance $\mathbb{V}[X]$ (resp. test accuracy). Filled circles/triangles show expectations, while bars show standard deviations across the 5 replications generated for each error rate $r$. Note that the error rates for test accuracy are small and that nearly all error bars are contained inside the triangle showing the expectation.
Figure 4: The evolution of training set variance $\mathbb{V}[X]$ (blue curve) and test accuracy (orange curve) over 20 purification iterations. Here a training set $T$ consists of 600 EMNIST samples digits '4' and '8' (evenly distributed) with an error rate $r=0.30$. An independent test set $S$ contains 600 correctly labeled EMNIST samples with the same digit distributions as $T$. Training was conducted using MRL with batch size 64 and learning rate $10^{-3}$. For reference, we trained a model on a single distribution training set $T_0$ of size $600$, evaluated on $S$, obtaining a test accuracy of $0.97$ at $r=0$, shown as the dashed black line. The maximal accuracy reached during purification was $0.957$ which occurred when $220$ data points had been removed, shown at the red dashed line.
Figure 5: Variance $\mathbb{V}[X]$ and test accuracy as functions of mixed synthetic data (SD-2) with two distributions ($A_1$, $A_2$) and $|Z| = 600$ using MLR. The horizontal axis shows $|A_2|$ with $|A_1|=|Z| - |A_2|$. The left vertical axis shows test variance (blue curve) while the right axis shows the test accuracy (orange curve). $Z$ is randomly split into training and test sets in an 8:2 ratio. The test accuracy is computed by training the model on the training set and evaluating it on the test set. Filled circles/triangles give the expectations while error bars show standard deviations; for each composition $k=5$ replications were generated. For the MLR training batch size was $64$ and learning rate was $10^{-3}$.
...and 3 more figures

Theorems & Definitions (20)

Lemma 1
Lemma 2
Lemma 3
Theorem 1
Lemma 4
Lemma 5
Lemma 6
Lemma 7
Theorem 2
Corollary 1
...and 10 more

Divide and Predict: An Architecture for Input Space Partitioning and Enhanced Accuracy

TL;DR

Abstract

Divide and Predict: An Architecture for Input Space Partitioning and Enhanced Accuracy

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (20)