Statistical Hypothesis Testing for Information Value (IV)

Helder Rojas; Cirilo Alvarez; Nilton Rojas

Statistical Hypothesis Testing for Information Value (IV)

Helder Rojas, Cirilo Alvarez, Nilton Rojas

TL;DR

This work addresses the lack of statistical justification for fixed IV thresholds by linking Information Value to the Jeffreys divergence and introducing a nonparametric J-Divergence test with solid asymptotic guarantees. The authors establish almost-sure consistency and asymptotic normality for the IV estimator, derive a normal-based test statistic, and validate performance via simulations and a real fraud-detection dataset. The approach yields a robust, interpretable pre-modeling filter that outperform traditional IV thresholding in imbalanced settings, and is complemented by an open-source Python library for practical adoption. Overall, the J-Divergence test offers a principled, model-agnostic alternative for feature selection prior to modeling, with clear pathways for extension to multinomial and integrated pipelines.

Abstract

Information Value (IV) is a widely used technique for feature selection prior to the modeling phase, particularly in credit scoring and related domains. However, conventional IV-based practices rely on fixed empirical thresholds, which lack statistical justification and may be sensitive to characteristics such as class imbalance. In this work, we develop a formal statistical framework for IV by establishing its connection with Jeffreys divergence and propose a novel nonparametric hypothesis test, referred to as the J-Divergence test. Our method provides rigorous asymptotic guarantees and enables interpretable decisions based on $p$-values. Numerical experiments, including synthetic and real-world data, demonstrate that the proposed test is more reliable than traditional IV thresholding, particularly under strong imbalance. The test is model-agnostic, computationally efficient, and well-suited for the pre-modeling phase in high-dimensional or imbalanced settings. An open-source Python library is provided for reproducibility and practical adoption.

Statistical Hypothesis Testing for Information Value (IV)

TL;DR

Abstract

-values. Numerical experiments, including synthetic and real-world data, demonstrate that the proposed test is more reliable than traditional IV thresholding, particularly under strong imbalance. The test is model-agnostic, computationally efficient, and well-suited for the pre-modeling phase in high-dimensional or imbalanced settings. An open-source Python library is provided for reproducibility and practical adoption.

Paper Structure (8 sections, 41 equations, 6 figures, 5 tables)

This paper contains 8 sections, 41 equations, 6 figures, 5 tables.

Introduction
Our critique and motivation
Main contributions
Definition of the hypotheses test
Performance of the test on simulated data
Features selection in fraud detection
Implementation
Conclusions and future work

Figures (6)

Figure 1: Comparison between the power function of the J-Divergence test (top line) and the empirical criterion IV $> 0.1$ (bottom line). The solid line at the top of the plot corresponds to the J-Divergence test, while the solid line at the bottom represents the IV $> 0.1$ criterion. This description ensures clarity when printed in black and white, as the lines can be distinguished based on their relative positions. Simulation parameters: $N = 50\,300$, $n = 300$, $m = 50\,000$, Imbalance-Rate = 0.00596, $\alpha = 0.1\%$, $r = 10$.
Figure 2: Power function of the J-Divergence test for different imbalance rates. The lines represent different imbalance rates, with the line at the top of the plot corresponding to an imbalance rate of 0.38, the next line representing 0.5, followed by 0.6, 0.67, 0.75, 0.86, and the bottom line representing 0.97. This description ensures clarity in black and white prints, as the lines can be distinguished based on their relative positions. Simulation parameters: $r = 10$, $\alpha = 0.1\%$, $n = 3000$, $m \in [100, 50\,00]$.
Figure 3: Power function of criterion IV $> 0.1$ for different imbalance rates. The lines represent different imbalance rates, with the topmost line corresponding to an imbalance rate of 0.38, followed by 0.5, 0.6, 0.67, 0.75, 0.86, and the bottommost line representing 0.97. This line style differentiation ensures clarity when printed in black and white, as the lines can be distinguished based on their relative positions. Simulation parameters: $r = 10$, $\alpha = 0.1\%$, $n = 3\,000$, $m \in [100, 5\,000]$.
Figure 4: Power function of the J-Divergence test for different values of $\alpha$. The lines represent different values of $\alpha$, with the line at the top of the plot corresponding to $\alpha = 10^{-8}$, followed by $\alpha = 10^{-7}$, $\alpha = 10^{-6}$, $\alpha = 10^{-5}$, $\alpha = 10^{-4}$, $\alpha = 0.001$, $\alpha = 0.002$, $\alpha = 0.005$, and the line at the bottom representing $\alpha = 0.05$. This line style differentiation ensures clarity when printed in black and white, as the lines can be distinguished based on their relative positions. Simulation parameters: $n = m \in [300, 2\,500]$, Imbalance-Rate = 0.5, $r = 10$.
Figure 5: Power function of the J-Divergence test for different numbers of bins. The lines represent different numbers of bins, with the line at the top corresponding to 20 bins, followed by 16, 14, 12, 10, 8, 6, 4, and the bottommost line representing 2 bins. This line style differentiation ensures clarity when printed in black and white, as the lines can be distinguished based on their relative positions. Simulation parameters: $n = m = 3\,000$, Imbalance-Rate = 0.5, $r \in [2, 20]$.
...and 1 more figures

Theorems & Definitions (5)

Remark 1
Remark 2
Remark 3
Remark 4
Remark 5

Statistical Hypothesis Testing for Information Value (IV)

TL;DR

Abstract

Statistical Hypothesis Testing for Information Value (IV)

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (5)