Time Series Feature Redundancy Paradox: An Empirical Study Based on Mortgage Default Prediction
Chengyue Huang, Yahe Yang
TL;DR
The paper tests whether longer training histories and larger feature sets improve mortgage default prediction, uncovering a feature redundancy paradox in time-series data. By analyzing Freddie Mac data from 2012–2022 across 1-, 3-, 5-, and 10-year windows and varying feature sets, it shows that shorter, more recent data combined with a small, carefully selected feature subset yields superior ROC-AUC performance, especially for Transformer models. The findings indicate that longer historical data can introduce temporal noise and outdated patterns, while excessive features can obscure core default indicators. The work offers actionable guidance on time-window selection and feature parsimony, with implications for dynamic retraining and risk forecasting in financial contexts.
Abstract
With the widespread application of machine learning in financial risk management, conventional wisdom suggests that longer training periods and more feature variables contribute to improved model performance. This paper, focusing on mortgage default prediction, empirically discovers a phenomenon that contradicts traditional knowledge: in time series prediction, increased training data timespan and additional non-critical features actually lead to significant deterioration in prediction effectiveness. Using Fannie Mae's mortgage data, the study compares predictive performance across different time window lengths (2012-2022) and feature combinations, revealing that shorter time windows (such as single-year periods) paired with carefully selected key features yield superior prediction results. The experimental results indicate that extended time spans may introduce noise from historical data and outdated market patterns, while excessive non-critical features interfere with the model's learning of core default factors. This research not only challenges the traditional "more is better" approach in data modeling but also provides new insights and practical guidance for feature selection and time window optimization in financial risk prediction.
