Multi-Modal Deep Learning for Credit Rating Prediction Using Text and Numerical Data Streams
Mahsa Tavakoli, Rohitash Chandra, Fengrui Tian, Cristián Bravo
TL;DR
This paper tackles credit rating prediction by integrating structured numerical data with unstructured earnings call transcripts through multimodal deep learning. It systematically compares fusion strategies (early/intermediate; concatenation vs cross-attention) across CNN, ConvLSTM, ConvGRU, CNN-Attn, and BERT, exploring four structural configurations and quantifying each modality's contribution. Key findings show that a CNN-based model with Hybrid Concatenation and early-intermediate fusion often yields the best performance, with the text channel providing the strongest predictive signal and cross-attention further enhancing multimodal integration. The study also demonstrates robustness under out-of-time and out-of-universe conditions, examines the impact of COVID-19 on performance, and reveals Moody’s ratings offer the most accurate timing for prediction, underscoring the practical relevance for rating agencies and financial institutions.
Abstract
Knowing which factors are significant in credit rating assignment leads to better decision-making. However, the focus of the literature thus far has been mostly on structured data, and fewer studies have addressed unstructured or multi-modal datasets. In this paper, we present an analysis of the most effective architectures for the fusion of deep learning models for the prediction of company credit rating classes, by using structured and unstructured datasets of different types. In these models, we tested different combinations of fusion strategies with different deep learning models, including CNN, LSTM, GRU, and BERT. We studied data fusion strategies in terms of level (including early and intermediate fusion) and techniques (including concatenation and cross-attention). Our results show that a CNN-based multi-modal model with two fusion strategies outperformed other multi-modal techniques. In addition, by comparing simple architectures with more complex ones, we found that more sophisticated deep learning models do not necessarily produce the highest performance; however, if attention-based models are producing the best results, cross-attention is necessary as a fusion strategy. Finally, our comparison of rating agencies on short-, medium-, and long-term performance shows that Moody's credit ratings outperform those of other agencies like Standard & Poor's and Fitch Ratings.
