Table of Contents
Fetching ...

An index of effective number of variables for uncertainty and reliability analysis in model selection problems

Luca Martino, Eduardo Morgado, Roberto San Millán-Castillo

Abstract

An index of an effective number of variables (ENV) is introduced for model selection in nested models. This is the case, for instance, when we have to decide the order of a polynomial function or the number of bases in a nonlinear regression, choose the number of clusters in a clustering problem, or the number of features in a variable selection application (to name few examples). It is inspired by the idea of the maximum area under the curve (AUC). The interpretation of the ENV index is identical to the effective sample size (ESS) indices concerning a set of samples. The ENV index improves {drawbacks of} the elbow detectors described in the literature and introduces different confidence measures of the proposed solution. These novel measures can be also employed jointly with the use of different information criteria, such as the well-known AIC and BIC, or any other model selection procedures. Comparisons with classical and recent schemes are provided in different experiments involving real datasets. Related Matlab code is given.

An index of effective number of variables for uncertainty and reliability analysis in model selection problems

Abstract

An index of an effective number of variables (ENV) is introduced for model selection in nested models. This is the case, for instance, when we have to decide the order of a polynomial function or the number of bases in a nonlinear regression, choose the number of clusters in a clustering problem, or the number of features in a variable selection application (to name few examples). It is inspired by the idea of the maximum area under the curve (AUC). The interpretation of the ENV index is identical to the effective sample size (ESS) indices concerning a set of samples. The ENV index improves {drawbacks of} the elbow detectors described in the literature and introduces different confidence measures of the proposed solution. These novel measures can be also employed jointly with the use of different information criteria, such as the well-known AIC and BIC, or any other model selection procedures. Comparisons with classical and recent schemes are provided in different experiments involving real datasets. Related Matlab code is given.
Paper Structure (17 sections, 24 equations, 5 figures, 4 tables)

This paper contains 17 sections, 24 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Classes of methods for model selection (standard ones and more recent approaches).
  • Figure 2: (a) Example of error function $V(k)$ where $K=6$, (b) Construction with two straight lines and the areas $A_1$, $A_2$, and $A_3$.
  • Figure 3: (a) We can consider that $V(k)$ is like a sampled curve obtained from sampling - in a signal processing sense - a continuous function $V(x)$ where $x\in \mathbb{R}$ (shown in dashed line) is an auxiliary continuous variable. The continuous function $V(x)$ possibly does not exist and can be just a theoretical tool to define the area $A_V$. (b) In any case, we have access to $V(k)$, $k\in \mathbb{N}$, which allows us to obtain the approximation $\widehat{A} \approx A_V$.
  • Figure 4: Special ideal cases. (a)$V(k)$ reaches zero already at $k=1$. Only the first component is relevant, hence the optimal choice $k^*=k_e=1$. (b)$V(k)$ is a straight line connecting the points $(0,V(0))$ and $(K,0)$. All variables contribute in the same way to the decay $V(k)$, hence the optimal choice $k^*=k_e=K$ (in figure $K=6$). (c) The function $V(k)$ is a straight line passing through the points $(0,V(0))$ and $(k^*, 0)$, that is $V(k)=V(0)-\frac{V(0)}{k^*}k$, so that $V(k^*)=0$ at some point $k^*<K$. Clearly, the point $k^*=k_e=3$ is an optimal choice: the first 3 variables have the same contribution to the decay $V(k)$ and completely explain the drop.
  • Figure 5: Ideal cases of four curves $V(k)$ (blue solid lines) where an "elbow" is well-defined at $k_e=14$. This is exactly the result that we obtain with the application of an elbow detector AEDpaperNuestroElbow2Elbow3Elbow4 (green circles), whereas the results provided by the ENV index are shown by red triangles. Note that, in all cases, $I_\texttt{ENV}> k_e$.