Table of Contents
Fetching ...

Multi-Modal Deep Learning for Credit Rating Prediction Using Text and Numerical Data Streams

Mahsa Tavakoli, Rohitash Chandra, Fengrui Tian, Cristián Bravo

TL;DR

This paper tackles credit rating prediction by integrating structured numerical data with unstructured earnings call transcripts through multimodal deep learning. It systematically compares fusion strategies (early/intermediate; concatenation vs cross-attention) across CNN, ConvLSTM, ConvGRU, CNN-Attn, and BERT, exploring four structural configurations and quantifying each modality's contribution. Key findings show that a CNN-based model with Hybrid Concatenation and early-intermediate fusion often yields the best performance, with the text channel providing the strongest predictive signal and cross-attention further enhancing multimodal integration. The study also demonstrates robustness under out-of-time and out-of-universe conditions, examines the impact of COVID-19 on performance, and reveals Moody’s ratings offer the most accurate timing for prediction, underscoring the practical relevance for rating agencies and financial institutions.

Abstract

Knowing which factors are significant in credit rating assignment leads to better decision-making. However, the focus of the literature thus far has been mostly on structured data, and fewer studies have addressed unstructured or multi-modal datasets. In this paper, we present an analysis of the most effective architectures for the fusion of deep learning models for the prediction of company credit rating classes, by using structured and unstructured datasets of different types. In these models, we tested different combinations of fusion strategies with different deep learning models, including CNN, LSTM, GRU, and BERT. We studied data fusion strategies in terms of level (including early and intermediate fusion) and techniques (including concatenation and cross-attention). Our results show that a CNN-based multi-modal model with two fusion strategies outperformed other multi-modal techniques. In addition, by comparing simple architectures with more complex ones, we found that more sophisticated deep learning models do not necessarily produce the highest performance; however, if attention-based models are producing the best results, cross-attention is necessary as a fusion strategy. Finally, our comparison of rating agencies on short-, medium-, and long-term performance shows that Moody's credit ratings outperform those of other agencies like Standard & Poor's and Fitch Ratings.

Multi-Modal Deep Learning for Credit Rating Prediction Using Text and Numerical Data Streams

TL;DR

This paper tackles credit rating prediction by integrating structured numerical data with unstructured earnings call transcripts through multimodal deep learning. It systematically compares fusion strategies (early/intermediate; concatenation vs cross-attention) across CNN, ConvLSTM, ConvGRU, CNN-Attn, and BERT, exploring four structural configurations and quantifying each modality's contribution. Key findings show that a CNN-based model with Hybrid Concatenation and early-intermediate fusion often yields the best performance, with the text channel providing the strongest predictive signal and cross-attention further enhancing multimodal integration. The study also demonstrates robustness under out-of-time and out-of-universe conditions, examines the impact of COVID-19 on performance, and reveals Moody’s ratings offer the most accurate timing for prediction, underscoring the practical relevance for rating agencies and financial institutions.

Abstract

Knowing which factors are significant in credit rating assignment leads to better decision-making. However, the focus of the literature thus far has been mostly on structured data, and fewer studies have addressed unstructured or multi-modal datasets. In this paper, we present an analysis of the most effective architectures for the fusion of deep learning models for the prediction of company credit rating classes, by using structured and unstructured datasets of different types. In these models, we tested different combinations of fusion strategies with different deep learning models, including CNN, LSTM, GRU, and BERT. We studied data fusion strategies in terms of level (including early and intermediate fusion) and techniques (including concatenation and cross-attention). Our results show that a CNN-based multi-modal model with two fusion strategies outperformed other multi-modal techniques. In addition, by comparing simple architectures with more complex ones, we found that more sophisticated deep learning models do not necessarily produce the highest performance; however, if attention-based models are producing the best results, cross-attention is necessary as a fusion strategy. Finally, our comparison of rating agencies on short-, medium-, and long-term performance shows that Moody's credit ratings outperform those of other agencies like Standard & Poor's and Fitch Ratings.
Paper Structure (29 sections, 1 equation, 12 figures, 8 tables)

This paper contains 29 sections, 1 equation, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Frequency of ratings in the original dataset, where some classes have a disproportionately low number of ratings, suggesting that they may not contain sufficient information to support accurate classification.
  • Figure 2: Average number of words per earnings call transcript across different rating classes.
  • Figure 3: This diagram illustrates various fusion strategies at different levels: Early, Intermediate, and Late Fusion. The Simple Fusion strategy is depicted in the first three sections, where only two modalities are involved, and the fusion occurs at a single level—either early, intermediate, or late. In the Hybrid Fusion strategy (last section), multiple modalities are combined at multiple levels (both early and intermediate), leveraging the strengths of each approach.
  • Figure 4: The diagram illustrates Self-Attention (top), where elements of a sequence attend to each other, and Cross-Attention (bottom), where one sequence (query) attends to another sequence (keys and values). The three vectors are used for data processing include query (Q), key (K), and value (V). Self-Attention captures internal dependencies within a single sequence, while Cross-Attention links information across different sequences, crucial for tasks like multi-modal learning and sequence-to-sequence models.
  • Figure 5: The four multimodal frameworks for credit rating prediction with different combinations of deep learning-based submodels (A and B) with fusion type and fusion level strategies. We present the submodel architectures and implementation in Table \ref{['tab:combination']}.
  • ...and 7 more figures