Table of Contents
Fetching ...

Robust Twoblock Simultaneous Dimension Reduction

Sven Serneels

Abstract

This paper introduces robust twoblock (RTB) simultaneous dimension reduction, which is the first statistically robust method to perform simultaneous dimension reduction in two blocks of variables and allows to fine-tune the model complexity in each block individually. The paper proposes both a dense and a sparse version of the new method. Sparse RTB is the first robust estimator that allows to select both model complexity and the degree of sparsity for each block individually. RTB thereby allows to optimally extract and summarize the relevant portion of information in each block of data, also in the presence of outliers. As a corollary, the estimators can be recombined into a single estimate of regression coefficients for multivariate regression that is operable when the number of variables exceeds the number of cases in each block. An extensive simulation study illustrates that the new methods are resistant to different types of outliers, while maintaining estimation efficiency. across a range of dimensionality settings. These findings both hold true for the dense and the sparse method. The methods' performance is further illustrated on two example data sets and a straightforward algorithm is presented and made accessible in an open source repository.

Robust Twoblock Simultaneous Dimension Reduction

Abstract

This paper introduces robust twoblock (RTB) simultaneous dimension reduction, which is the first statistically robust method to perform simultaneous dimension reduction in two blocks of variables and allows to fine-tune the model complexity in each block individually. The paper proposes both a dense and a sparse version of the new method. Sparse RTB is the first robust estimator that allows to select both model complexity and the degree of sparsity for each block individually. RTB thereby allows to optimally extract and summarize the relevant portion of information in each block of data, also in the presence of outliers. As a corollary, the estimators can be recombined into a single estimate of regression coefficients for multivariate regression that is operable when the number of variables exceeds the number of cases in each block. An extensive simulation study illustrates that the new methods are resistant to different types of outliers, while maintaining estimation efficiency. across a range of dimensionality settings. These findings both hold true for the dense and the sparse method. The methods' performance is further illustrated on two example data sets and a straightforward algorithm is presented and made accessible in an open source repository.

Paper Structure

This paper contains 28 sections, 5 equations, 9 figures, 2 tables, 3 algorithms.

Figures (9)

  • Figure 1: MSE of regression coefficient estimates across 42 simulation scenarios (200 repeats each). Dark blue: TB dense; dark red: RTB dense; light blue: TB sparse; salmon: RTB sparse. Top row: $p \leq n$; bottom row: $p > n$. Within each panel, the x-axis shows the contamination proportion and type. Error bars indicate one standard deviation.
  • Figure 2: MSE ratio of RTB to twoblock for dense (left) and sparse (right) variants across all simulation scenarios. Green cells ($< 1$) indicate RTB outperforms twoblock; red cells ($> 1$) the reverse. The only red cells correspond to the uncontaminated baseline.
  • Figure 3: Variable selection $F_1$ score for sparse twoblock (light blue) and sparse RTB (salmon) across contamination scenarios, for configurations with noise variables ($\eta_x = 0.5$). Higher is better.
  • Figure 4: RTB dense case weights for the cookie dough training data. Case 23 (red) and case 24 are assigned near-zero weights.
  • Figure 5: RTB sparse case weights for the cookie dough training data. Cases 23 and 24 (red: case 23) are again assigned near-zero weights.
  • ...and 4 more figures