Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs

Felix Drinkall; Janet B. Pierrehumbert; Stefan Zohren

Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs

Felix Drinkall, Janet B. Pierrehumbert, Stefan Zohren

TL;DR

This study benchmarks generative LLMs against a discriminative XGBoost baseline for forecasting changes in corporate credit ratings using multimodal data that include SEC MDA text, fundamental financials, macroeconomics, and historical ratings. The task is formalized as predicting the next-quarter rating movement $\ ilde{R}_t$ from inputs $T_{t-1..p}$, $R_{t-1..p}$, and $N_{t-1..p}$, with $p\in\{1,2,3,4\}$. The key finding is that, although LLMs are strong at encoding text, they underperform when numeric signals are integrated, whereas the XGBoost baseline with High-density Embedding Clusters (HEC) features and numeric data achieves the best performance; zero-shot text-only prompting by GPT-family models can rival text-based encodings but fails to outperform models that combine modalities. The results underscore the continued value of discriminative, interpretable, multimodal approaches for financial forecasting and motivate future work on ensembles and more effective fusion of long textual data with time-series signals.

Abstract

Large Language Models (LLMs) have been shown to perform well for many downstream tasks. Transfer learning can enable LLMs to acquire skills that were not targeted during pre-training. In financial contexts, LLMs can sometimes beat well-established benchmarks. This paper investigates how well LLMs perform in the task of forecasting corporate credit ratings. We show that while LLMs are very good at encoding textual information, traditional methods are still very competitive when it comes to encoding numeric and multimodal data. For our task, current LLMs perform worse than a more traditional XGBoost architecture that combines fundamental and macroeconomic data with high-density text-based embedding features.

Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs

TL;DR

from inputs

, and

, with

. The key finding is that, although LLMs are strong at encoding text, they underperform when numeric signals are integrated, whereas the XGBoost baseline with High-density Embedding Clusters (HEC) features and numeric data achieves the best performance; zero-shot text-only prompting by GPT-family models can rival text-based encodings but fails to outperform models that combine modalities. The results underscore the continued value of discriminative, interpretable, multimodal approaches for financial forecasting and motivate future work on ensembles and more effective fusion of long textual data with time-series signals.

Abstract

Paper Structure (29 sections, 2 equations, 2 figures, 7 tables)

This paper contains 29 sections, 2 equations, 2 figures, 7 tables.

Introduction
Related Work
Text-based forecasting
Encoding Text for Forecasting
Generative Multimodal Forecasting
Credit Rating Prediction
Dataset
Credit ratings (C)
SEC filings
Fundamental data (F)
Macroeconomic data (M)
Dataset Construction
Methodology
Task Description
Boosting-Tree Baseline
...and 14 more sections

Figures (2)

Figure 1: Example of the best-performing feature - high-density clustering drinkall-etal-2022-forecasting. Each dot represents a sentence, and the colored areas representing high-density regions of the embedding space.
Figure 2: Partial Dependence Plots (PDP) of text-based features against different target classes.

Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs

TL;DR

Abstract

Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (2)