KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models

Neel Rajani; Lilli Kiessling; Aleksandr Ogaltsov; Claus Lang

KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models

Neel Rajani, Lilli Kiessling, Aleksandr Ogaltsov, Claus Lang

TL;DR

The paper tackles the challenge of delivering reliable financial NLP by developing KodeXv0.1, a family of Llama 3.1–based models fine-tuned with a large, synthetic, multilingual financial corpus. It demonstrates that 8B and 70B variants, trained with 4-bit quantization and LoRA in a post-training regime, achieve state-of-the-art performance across multiple financial benchmarks, including outperforming GPT-4 on all tested metrics. The approach leverages a retrieval-augmented strategy and a rigorous LLM-as-judge evaluation to ensure reliability, while highlighting the practicality of deploying smaller open-source models in production pipelines. Overall, the work suggests that domain-specific synthetic data and careful instruction tuning can surpass proprietary models, with significant implications for cost-efficient, private, and scalable financial AI systems.

Abstract

Although powerful, current cutting-edge LLMs may not fulfil the needs of highly specialised sectors. We introduce KodeXv0.1, a family of large language models that outclass GPT-4 in financial question answering. We utilise the base variants of Llama 3.1 8B and 70B and adapt them to the financial domain through a custom training regime. To this end, we collect and process a large number of publicly available financial documents such as earnings calls and business reports. These are used to generate a high-quality, synthetic dataset consisting of Context-Question-Answer triplets which closely mirror real-world financial tasks. Using the train split of this dataset, we perform RAG-aware 4bit LoRA instruction tuning runs of Llama 3.1 base variants to produce KodeX-8Bv0.1 and KodeX-70Bv0.1. We then complete extensive model evaluations using FinanceBench, FinQABench and the withheld test split of our dataset. Our results show that KodeX-8Bv0.1 is more reliable in financial contexts than cutting-edge instruct models in the same parameter regime, surpassing them by up to 9.24%. In addition, it is even capable of outperforming state-of-the-art proprietary models such as GPT-4 by up to 7.07%. KodeX-70Bv0.1 represents a further improvement upon this, exceeding GPT-4's performance on every tested benchmark.

KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models

TL;DR

Abstract

Paper Structure (17 sections, 8 figures, 2 tables)

This paper contains 17 sections, 8 figures, 2 tables.

Introduction
Related Works
Financial LLMs and their evaluation
Instruction following
Methods and Experimental Setup
Training data generation
Benchmarking
Training details
Results and Evaluation
Withheld test set
FinQABench
FinanceBench
Conclusion
Future research directions
Limitations
...and 2 more sections

Figures (8)

Figure 1: Performance comparison between KodeXv0.1 models against open-source instruct models and GPT-4. The 8B variant exhibits best-in-class performance, and the 70B variant outmatches GPT-4 on every benchmark.
Figure 2: Our instruction-tuning setup. A benchmark instance is parsed, inserted into the Llama 3.1 instruction-tuning format, and tokenized. The labels are created by shifting the inputs to the left by one, and masking tokens from the instruction with -100. This signals to PyTorch that these labels should be ignored when computing the loss.
Figure 3: Distribution of categories from which the documents in the training data were chosen.
Figure 4: Example of one training sample.
Figure 5: Comparison of the frequency with which KodeX-70Bv0.1 on GPT-4 obtains a certain mark out of 10 on the withheld test set.
...and 3 more figures

KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models

TL;DR

Abstract

KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)