Table of Contents
Fetching ...

An Enhanced Projection Pursuit Tree Classifier with Visual Methods for Assessing Algorithmic Improvements

Natalia da Silva, Dianne Cook, Eun-Kyung Lee

TL;DR

Extensions to the projection pursuit tree classifier and visual diagnostic methods for assessing their impact in high dimensions are presented and two visual diagnostic approaches are developed to verify that the enhancements perform as intended.

Abstract

This paper presents enhancements to the projection pursuit tree classifier and visual diagnostic methods for assessing their impact in high dimensions. The original algorithm uses linear combinations of variables in a tree structure where depth is constrained to be less than the number of classes -- a limitation that proves too rigid for complex classification problems. Our extensions improve performance in multi-class settings with unequal variance-covariance structures and nonlinear class separations by allowing more splits and more flexible class groupings in the projection pursuit computation. Proposing algorithmic improvements is straightforward; demonstrating their actual utility is not. We therefore develop two visual diagnostic approaches to verify that the enhancements perform as intended. Using high-dimensional visualization techniques, we examine model fits on benchmark datasets to assess whether the algorithm behaves as theorized. An interactive web application enables users to explore the behavior of both the original and enhanced classifiers under controlled scenarios. The enhancements are implemented in the R package PPtreeExt.

An Enhanced Projection Pursuit Tree Classifier with Visual Methods for Assessing Algorithmic Improvements

TL;DR

Extensions to the projection pursuit tree classifier and visual diagnostic methods for assessing their impact in high dimensions are presented and two visual diagnostic approaches are developed to verify that the enhancements perform as intended.

Abstract

This paper presents enhancements to the projection pursuit tree classifier and visual diagnostic methods for assessing their impact in high dimensions. The original algorithm uses linear combinations of variables in a tree structure where depth is constrained to be less than the number of classes -- a limitation that proves too rigid for complex classification problems. Our extensions improve performance in multi-class settings with unequal variance-covariance structures and nonlinear class separations by allowing more splits and more flexible class groupings in the projection pursuit computation. Proposing algorithmic improvements is straightforward; demonstrating their actual utility is not. We therefore develop two visual diagnostic approaches to verify that the enhancements perform as intended. Using high-dimensional visualization techniques, we examine model fits on benchmark datasets to assess whether the algorithm behaves as theorized. An interactive web application enables users to explore the behavior of both the original and enhanced classifiers under controlled scenarios. The enhancements are implemented in the R package PPtreeExt.
Paper Structure (12 sections, 5 equations, 17 figures)

This paper contains 12 sections, 5 equations, 17 figures.

Figures (17)

  • Figure 1: Illustration of the original PPtree algorithm for $G=3$.
  • Figure 2: Comparison of decision boundaries produced by the rpart (left) and PPtree (right) algorithms on two-dimensional simulated data. The boundaries generated by PPtree are oblique to the coordinate axes, capturing the linear association between the two variables.
  • Figure 3: Comparison of decision boundaries produced by the rpart (left) and PPtree (right) algorithms on two-dimensional simulated data. The orange class cannot be separated using a single linear partition, and PPtree fails to model it accurately because each original class must be assigned to a single terminal node.
  • Figure 4: Illustration of the algorithm with Modification 1 applied to a three-class problem. At each node, the second one-dimensional projection is computed using only the two closest groups to determine the best projection direction and split point.
  • Figure 5: Illustration of the algorithm with Modification 2 applied to a three-class problem. Projections of the data are computed at each node, and multiple splits per class are allowed. An impurity-based criterion, such as entropy, is used to determine when and how often to split observations within a node.
  • ...and 12 more figures