An index of effective number of variables for uncertainty and reliability analysis in model selection problems

Luca Martino; Eduardo Morgado; Roberto San Millán-Castillo

An index of effective number of variables for uncertainty and reliability analysis in model selection problems

Luca Martino, Eduardo Morgado, Roberto San Millán-Castillo

Abstract

An index of an effective number of variables (ENV) is introduced for model selection in nested models. This is the case, for instance, when we have to decide the order of a polynomial function or the number of bases in a nonlinear regression, choose the number of clusters in a clustering problem, or the number of features in a variable selection application (to name few examples). It is inspired by the idea of the maximum area under the curve (AUC). The interpretation of the ENV index is identical to the effective sample size (ESS) indices concerning a set of samples. The ENV index improves {drawbacks of} the elbow detectors described in the literature and introduces different confidence measures of the proposed solution. These novel measures can be also employed jointly with the use of different information criteria, such as the well-known AIC and BIC, or any other model selection procedures. Comparisons with classical and recent schemes are provided in different experiments involving real datasets. Related Matlab code is given.

An index of effective number of variables for uncertainty and reliability analysis in model selection problems

Abstract

Paper Structure (17 sections, 24 equations, 5 figures, 4 tables)

This paper contains 17 sections, 24 equations, 5 figures, 4 tables.

Introduction
Framework and main notation
The error curve $V(k)$ as a figure of merit
The universal automatic elbow detector
An index of the effective number of variables (ENV)
Derivation of the ENV index
Behavior of $I_\texttt{ENV}$ in ideal cases
Interpreting and using the ENV index
First considerations
Confidence measures for the decision
Numerical experiments
Synthetic experiment where $V(k)$ is an analytic function
Variable selection in a regression problem with real data
Variable selection in a biomedical classification problem with real data
Conclusions
...and 2 more sections

Figures (5)

Figure 1: Classes of methods for model selection (standard ones and more recent approaches).
Figure 2: (a) Example of error function $V(k)$ where $K=6$, (b) Construction with two straight lines and the areas $A_1$, $A_2$, and $A_3$.
Figure 3: (a) We can consider that $V(k)$ is like a sampled curve obtained from sampling - in a signal processing sense - a continuous function $V(x)$ where $x\in \mathbb{R}$ (shown in dashed line) is an auxiliary continuous variable. The continuous function $V(x)$ possibly does not exist and can be just a theoretical tool to define the area $A_V$. (b) In any case, we have access to $V(k)$, $k\in \mathbb{N}$, which allows us to obtain the approximation $\widehat{A} \approx A_V$.
Figure 4: Special ideal cases. (a)$V(k)$ reaches zero already at $k=1$. Only the first component is relevant, hence the optimal choice $k^*=k_e=1$. (b)$V(k)$ is a straight line connecting the points $(0,V(0))$ and $(K,0)$. All variables contribute in the same way to the decay $V(k)$, hence the optimal choice $k^*=k_e=K$ (in figure $K=6$). (c) The function $V(k)$ is a straight line passing through the points $(0,V(0))$ and $(k^*, 0)$, that is $V(k)=V(0)-\frac{V(0)}{k^*}k$, so that $V(k^*)=0$ at some point $k^*<K$. Clearly, the point $k^*=k_e=3$ is an optimal choice: the first 3 variables have the same contribution to the decay $V(k)$ and completely explain the drop.
Figure 5: Ideal cases of four curves $V(k)$ (blue solid lines) where an "elbow" is well-defined at $k_e=14$. This is exactly the result that we obtain with the application of an elbow detector AEDpaperNuestroElbow2Elbow3Elbow4 (green circles), whereas the results provided by the ENV index are shown by red triangles. Note that, in all cases, $I_\texttt{ENV}> k_e$.

An index of effective number of variables for uncertainty and reliability analysis in model selection problems

Abstract

An index of effective number of variables for uncertainty and reliability analysis in model selection problems

Authors

Abstract

Table of Contents

Figures (5)