Table of Contents
Fetching ...

Ranking Perspective for Tree-based Methods with Applications to Symbolic Feature Selection

Hengrui Luo, Meng Li

TL;DR

This work provides a finite-sample analysis of tree-based methods from a ranking perspective, and proposes concordant divergence statistics $\mathcal{T}_0$ to evaluate symbolic feature mappings and establish their properties.

Abstract

Tree-based methods are powerful nonparametric techniques in statistics and machine learning. However, their effectiveness, particularly in finite-sample settings, is not fully understood. Recent applications have revealed their surprising ability to distinguish transformations (which we call symbolic feature selection) that remain obscure under current theoretical understanding. This work provides a finite-sample analysis of tree-based methods from a ranking perspective. We link oracle partitions in tree methods to response rankings at local splits, offering new insights into their finite-sample behavior in regression and feature selection tasks. Building on this local ranking perspective, we extend our analysis in two ways: (i) We examine the global ranking performance of individual trees and ensembles, including Classification and Regression Trees (CART) and Bayesian Additive Regression Trees (BART), providing finite-sample oracle bounds, ranking consistency, and posterior contraction results. (ii) Inspired by the ranking perspective, we propose concordant divergence statistics $\mathcal{T}_0$ to evaluate symbolic feature mappings and establish their properties. Numerical experiments demonstrate the competitive performance of these statistics in symbolic feature selection tasks compared to existing methods.

Ranking Perspective for Tree-based Methods with Applications to Symbolic Feature Selection

TL;DR

This work provides a finite-sample analysis of tree-based methods from a ranking perspective, and proposes concordant divergence statistics to evaluate symbolic feature mappings and establish their properties.

Abstract

Tree-based methods are powerful nonparametric techniques in statistics and machine learning. However, their effectiveness, particularly in finite-sample settings, is not fully understood. Recent applications have revealed their surprising ability to distinguish transformations (which we call symbolic feature selection) that remain obscure under current theoretical understanding. This work provides a finite-sample analysis of tree-based methods from a ranking perspective. We link oracle partitions in tree methods to response rankings at local splits, offering new insights into their finite-sample behavior in regression and feature selection tasks. Building on this local ranking perspective, we extend our analysis in two ways: (i) We examine the global ranking performance of individual trees and ensembles, including Classification and Regression Trees (CART) and Bayesian Additive Regression Trees (BART), providing finite-sample oracle bounds, ranking consistency, and posterior contraction results. (ii) Inspired by the ranking perspective, we propose concordant divergence statistics to evaluate symbolic feature mappings and establish their properties. Numerical experiments demonstrate the competitive performance of these statistics in symbolic feature selection tasks compared to existing methods.
Paper Structure (24 sections, 11 theorems, 56 equations, 6 figures, 3 tables)

This paper contains 24 sections, 11 theorems, 56 equations, 6 figures, 3 tables.

Key Result

Lemma 1

(Oracle 2-partition with fixed sizes) For a 2-partition of n elements $y_{(1)}<y_{(2)}<\cdots<y_{(n)}$ into components of size $i$ and $n-i$, we assume that $n>4,\min(n-i,i)\geq2$ to ensure variances are defined. Then the following partitions are the only 2-partitions of size $i$ and $n-i$ that minimize eq:loss.y.ranking.

Figures (6)

  • Figure 1: We illustrate 2-layer symbolic regression with $\mathcal{O}_{u}=\{id,x^{3}\}$ and $\mathcal{O}_{b}=\{+,\times\}$. We also follow the notation convention $\mathcal{O}_{A_{u}}^{(2)}$ and $\mathcal{O}_{A_{b}}^{(2)}$ for the architectures specified in ye2021operator. We displayed all of the possible features in a 2-step symbolic composition using tree structure, showing the rapidly increasing number $q$ of features, namely transformed symbolic feature $\bm{z}$'s.
  • Figure 2: A depth 2 tree with 5 observations showing two possible oracle partitions in Lemma \ref{['lem:LemmaA']}. In the first column, we present the raw $(x_{i},y_{i})$ pair of dataset; In the second column, we present the oracle partition using red and blue colors, and the support of indicator functions on the $x$-axis. The horizontal solid lines represent the group mean of $y$ values (as prediction value as well); the vertical dashed lines represent the point-to-mean distances. In the third column, we illustrate the loss function \ref{['eq:loss.y.ranking']} The minimum in row (a) is attained by $\{y_{(4)},y_{(5)}\}=\{y_{1},y_{5}\}$ and $\{y_{(1)},y_{(2)},y_{(3)}\}=\{y_{2},y_{3},y_{4}\}$. The minimum in orw (b) is attained by $\{y_{(3)},y_{(4)},y_{(5)}\}=\{y_{1},y_{2},y_{5}\}$ and $\{y_{(1)},y_{(2)}\}=\{y_{3},y_{4}\}$. We color the dots by the actual loss function values, and annotate the ordered statistics near each dot.
  • Figure 3: Refined monotonic intervals $\mathcal{I}_{2}=\{[0,1/2],[1/2,1]\}$ for the $\theta_{1}(x)=x$, $\theta_{2}(x)=-4x^{2}+4x$ shown. We use vertical black dashed lines to illustrate the refined monotonic intervals, and count the number of pre-images for $\theta_{1},\theta_{2}$ over each refined intervals.
  • Figure 4: Correlation between $\bm{x}$ and $\bm{y} = \theta_i(\bm{x})$ for $i = 1, \ldots, 5$. The expression and figure for each $\theta_i$ are reported in the top two rows in the table. Left to Right (in the 3rd and 4th rows): Chatterjee correlation chatterjee2021new, absolute Pearson correlation, absolute Spearman correlation and absolute Kendall correlation, $\log(\mathcal{T}_{0})$. The $\mathcal{T}_{0}$ is shown on a log-scale for better comparison. We generate an equally spaced $\bm{x}$ on $[-1,1]$ with sample size $N=50$ (3rd row) and $N=500$ (4th row). Gaussian noises with variance $\sigma^2$ are added to $\theta_i(\bm{x})$.
  • Figure 5: We illustrate the PR curves from 50 repeats ($n=100$) of a 2-layer symbolic regression with $\mathcal{O}_{u}=\{id,x^{3}\}$ and $\mathcal{O}_{b}=\{+,\times\}$. The true signal is \ref{['eq:3_var_true_signal']} with no noise. The first row corresponds to the architecture of $\mathcal{O}_{A_{u}}^{(2)}$ and the second row corresponds to the architecture of $\mathcal{O}_{A_{b}}^{(2)}$. We provide the boxplot to show the AUC values amongst 50 repeats.
  • ...and 1 more figures

Theorems & Definitions (26)

  • Example 1
  • Example 2
  • Lemma 1
  • Remark 2
  • Example 3
  • Remark 3
  • Corollary 4
  • Example 4
  • Example 5
  • Definition 5
  • ...and 16 more