Table of Contents
Fetching ...

Finite Sample Complexity Analysis of Binary Segmentation

Toby Dylan Hocking

TL;DR

New methods for analyzing the time and space complexity of binary segmentation for a given finite data and minimum segment length parameter are described and empirical analysis of real data suggests that binary segmentation is often close to optimal speed in practice.

Abstract

Binary segmentation is the classic greedy algorithm which recursively splits a sequential data set by optimizing some loss or likelihood function. Binary segmentation is widely used for changepoint detection in data sets measured over space or time, and as a sub-routine for decision tree learning. In theory it should be extremely fast for $N$ data and $K$ splits, $O(N K)$ in the worst case, and $O(N \log K)$ in the best case. In this paper we describe new methods for analyzing the time and space complexity of binary segmentation for a given finite $N$, $K$, and minimum segment length parameter. First, we describe algorithms that can be used to compute the best and worst case number of splits the algorithm must consider. Second, we describe synthetic data that achieve the best and worst case and which can be used to test for correct implementation of the algorithm. Finally, we provide an empirical analysis of real data which suggests that binary segmentation is often close to optimal speed in practice.

Finite Sample Complexity Analysis of Binary Segmentation

TL;DR

New methods for analyzing the time and space complexity of binary segmentation for a given finite data and minimum segment length parameter are described and empirical analysis of real data suggests that binary segmentation is often close to optimal speed in practice.

Abstract

Binary segmentation is the classic greedy algorithm which recursively splits a sequential data set by optimizing some loss or likelihood function. Binary segmentation is widely used for changepoint detection in data sets measured over space or time, and as a sub-routine for decision tree learning. In theory it should be extremely fast for data and splits, in the worst case, and in the best case. In this paper we describe new methods for analyzing the time and space complexity of binary segmentation for a given finite , , and minimum segment length parameter. First, we describe algorithms that can be used to compute the best and worst case number of splits the algorithm must consider. Second, we describe synthetic data that achieve the best and worst case and which can be used to test for correct implementation of the algorithm. Finally, we provide an empirical analysis of real data which suggests that binary segmentation is often close to optimal speed in practice.

Paper Structure

This paper contains 17 sections, 1 theorem, 9 equations, 4 figures, 2 tables.

Key Result

Theorem 1

The best case number of candidate splits that must be computed in binary segmentation, if a segment of $N$ data is split $K$ times into segments of min size $m$, can be determined as follows. First, for all $N$ we initialize $f(N,0) = g(N)$. Then, for all $K>0$ and for all $N$ we use the following d

Figures (4)

  • Figure 1: Demonstration of binary segmentation algorithm on a simple synthetic data set (details in Section \ref{['ties-in-different-segments']}) for which a tie-breaking rule is required to achieve the best case number of splits to consider. After the first two splits, the next three splits all have an equal loss decrease value (1.33), so the best case time complexity can be achieved with a tie-breaking rule that chooses a split which results in the smallest number of candidate splits to consider afterwards.
  • Figure 2: Optimal binary trees constructed to determine optimal number of candidate splits for $N\in\{60,71,72,80\}$ data, min segment length $m=5$, and number of splits/iterations $I=9$. Smaller $N$ values result in a balanced first split, whereas larger $N$ values result in an unbalanced first split (one small child with no splits, one large child with all remaining splits).
  • Figure 3: Analysis of 2752 real genomic count data sets from McGill benchmark of data size $N$ from 87 to 263169, using binary segmentation with the Poisson loss. Number of candidate splits to consider in real data (grey dots) achieves the asymptotic best case, $O(N\log N)$.
  • Figure 4: Analysis of 13721 real genomic data sets from neuroblastoma benchmark of data size $N$ from 11 to 5937, using binary segmentation with the square loss. Number of candidate splits to consider in real data achieves the asymptotic best case, $O(N\log N)$, and in 45 instances (black circles) requires fewer candidate splits than predicted by the best case heuristic. For max segments = 10, we used dynamic programming to compute the best case number of splits for all data sizes between 11 and 100 (orange line), and the heuristic (green line) was exact for only 5 data sizes $N\in\{11,12,13,18,19\}$ (purple circles).

Theorems & Definitions (2)

  • Theorem 1
  • proof