Table of Contents
Fetching ...

Loss Given Default Prediction Under Measurement-Induced Mixture Distributions: An Information-Theoretic Approach

Javier Marín

TL;DR

This work analyzes loss given default (LGD) modeling under a severe measurement-induced mixture, where proxy estimates dominate training data ($\pi_{\text{proxy}}=0.897$). It shows that recursive partitioning methods, such as Random Forests, systematically misfit due to bias toward the proxy distribution, achieving a negative $R^2$ on held-out data. The authors develop an information-theoretic framework that weights features by mutual information and accounts for uncertainty via entropy, attaining an RMSE of $0.284$ and $R^2=0.191$ on 1,218 bankruptcies, with a theoretical ceiling near $R^2_{\max}\approx0.40$. These results provide practical guidance for LGD model deployment under Basel III when representative outcome data are scarce, and the approach generalizes to other domains with extended observation needs that induce mixture training data.

Abstract

Loss Given Default (LGD) modeling faces a fundamental data quality constraint: 90% of available training data consists of proxy estimates based on pre-distress balance sheets rather than actual recovery outcomes from completed bankruptcy proceedings. We demonstrate that this mixture-contaminated training structure causes systematic failure of recursive partitioning methods, with Random Forest achieving negative r-squared (-0.664, worse than predicting the mean) on held-out test data. Information-theoretic approaches based on Shannon entropy and mutual information provide superior generalization, achieving r-squared of 0.191 and RMSE of 0.284 on 1,218 corporate bankruptcies (1980-2023). Analysis reveals that leverage-based features contain 1.510 bits of mutual information while size effects contribute only 0.086 bits, contradicting regulatory assumptions about scale-dependent recovery. These results establish practical guidance for financial institutions deploying LGD models under Basel III requirements when representative outcome data is unavailable at sufficient scale. The findings generalize to medical outcomes research, climate forecasting, and technology reliability-domains where extended observation periods create unavoidable mixture structure in training data.

Loss Given Default Prediction Under Measurement-Induced Mixture Distributions: An Information-Theoretic Approach

TL;DR

This work analyzes loss given default (LGD) modeling under a severe measurement-induced mixture, where proxy estimates dominate training data (). It shows that recursive partitioning methods, such as Random Forests, systematically misfit due to bias toward the proxy distribution, achieving a negative on held-out data. The authors develop an information-theoretic framework that weights features by mutual information and accounts for uncertainty via entropy, attaining an RMSE of and on 1,218 bankruptcies, with a theoretical ceiling near . These results provide practical guidance for LGD model deployment under Basel III when representative outcome data are scarce, and the approach generalizes to other domains with extended observation needs that induce mixture training data.

Abstract

Loss Given Default (LGD) modeling faces a fundamental data quality constraint: 90% of available training data consists of proxy estimates based on pre-distress balance sheets rather than actual recovery outcomes from completed bankruptcy proceedings. We demonstrate that this mixture-contaminated training structure causes systematic failure of recursive partitioning methods, with Random Forest achieving negative r-squared (-0.664, worse than predicting the mean) on held-out test data. Information-theoretic approaches based on Shannon entropy and mutual information provide superior generalization, achieving r-squared of 0.191 and RMSE of 0.284 on 1,218 corporate bankruptcies (1980-2023). Analysis reveals that leverage-based features contain 1.510 bits of mutual information while size effects contribute only 0.086 bits, contradicting regulatory assumptions about scale-dependent recovery. These results establish practical guidance for financial institutions deploying LGD models under Basel III requirements when representative outcome data is unavailable at sufficient scale. The findings generalize to medical outcomes research, climate forecasting, and technology reliability-domains where extended observation periods create unavoidable mixture structure in training data.

Paper Structure

This paper contains 27 sections, 17 equations, 7 tables.