Table of Contents
Fetching ...

SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

Toby Dylan Hocking, Gabrielle Thibault, Cameron Scott Bodine, Paul Nelson Arellano, Alexander F Shenkin, Olivia Jasmine Lindly

TL;DR

This work proposes SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions about whether data subsets are similar enough so that it is beneficial to combine subsets during model training.

Abstract

In many real-world applications of machine learning, we are interested to know if it is possible to train on the data that we have gathered so far, and obtain accurate predictions on a new test data subset that is qualitatively different in some respect (time period, geographic region, etc). Another question is whether data subsets are similar enough so that it is beneficial to combine subsets during model training. We propose SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions. SOAK systematically compares models which are trained on different subsets of data, and then used for prediction on a fixed test subset, to estimate the similarity of learnable/predictable patterns in data subsets. We show results of using SOAK on six new real data sets (with geographic/temporal subsets, to check if predictions are accurate on new subsets), 3 image pair data sets (subsets are different image types, to check that we get smaller prediction error on similar images), and 11 benchmark data sets with predefined train/test splits (to check similarity of predefined splits).

SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

TL;DR

This work proposes SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions about whether data subsets are similar enough so that it is beneficial to combine subsets during model training.

Abstract

In many real-world applications of machine learning, we are interested to know if it is possible to train on the data that we have gathered so far, and obtain accurate predictions on a new test data subset that is qualitatively different in some respect (time period, geographic region, etc). Another question is whether data subsets are similar enough so that it is beneficial to combine subsets during model training. We propose SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions. SOAK systematically compares models which are trained on different subsets of data, and then used for prediction on a fixed test subset, to estimate the similarity of learnable/predictable patterns in data subsets. We show results of using SOAK on six new real data sets (with geographic/temporal subsets, to check if predictions are accurate on new subsets), 3 image pair data sets (subsets are different image types, to check that we get smaller prediction error on similar images), and 11 benchmark data sets with predefined train/test splits (to check similarity of predefined splits).

Paper Structure

This paper contains 27 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: SOAK (Same/Other/All K-fold CV) requires adding subset/fold columns to the data (upper left). For one iteration of SOAK train/test splits (black box, lower right), current test subset=B, so Same=B/Other=A/All=A+B are the values of subset which are used to define the train set, in combination with the current fold=3, so test sets shown have subset=B and fold=3, Same train set has subset=B and fold$\in\{1,2\}$, etc.
  • Figure 2: Test error (top) and training time (bottom) of five algorithms on four data sets.
  • Figure 3: SOAK was used to compute mean/SD of test error over 10 cross-validation folds, and p-values for differences (other-same and all-same), in each of four data sets in which there were two subsets (predefined train/test assignments in the data table). For data sets that have similar learnable/predictable patterns (left), training on all subsets has smaller test error than same, and training on other has either smaller or larger test error than same (depending on number of rows in subset). For data sets that have different learnable/predictable patterns (right), training on all subsets never has smaller test error than same, and training on other always has larger test error than same.
  • Figure 4: SOAK was used to compute mean test error differences (All-Same) and p-values for each test subset, over 10 cross-validation folds. Line segments and table show min/max values over 2--4 test subsets in each data set; dot shows mean. Horizontal black line separates data sets by the degree of differences in learnable/predictable patterns: top 10 for large differences (min/max ErrorDiff positive or zero: never beneficial to combine subsets when training) and bottom 10 for small differences (min/max ErrorDiff negative or zero: never detrimental to combine subsets).
  • Figure 5: SOAK was used to compute mean test error differences (Other-Same) and p-values for each test subset, over 10 cross-validation folds. Line segments and table show min/max values over 2--4 test subsets in each data set; dot shows mean. Horizontal black line separates data sets by the degree of differences in learnable/predictable patterns: top 10 for large differences (min/max ErrorDiff both positive: inaccurate prediction on all new subsets) and bottom 10 for small differences (min ErrorDiff negative, max positive: accurate prediction on at least one new subset).