Table of Contents
Fetching ...

KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models

Neel Rajani, Lilli Kiessling, Aleksandr Ogaltsov, Claus Lang

TL;DR

The paper tackles the challenge of delivering reliable financial NLP by developing KodeXv0.1, a family of Llama 3.1–based models fine-tuned with a large, synthetic, multilingual financial corpus. It demonstrates that 8B and 70B variants, trained with 4-bit quantization and LoRA in a post-training regime, achieve state-of-the-art performance across multiple financial benchmarks, including outperforming GPT-4 on all tested metrics. The approach leverages a retrieval-augmented strategy and a rigorous LLM-as-judge evaluation to ensure reliability, while highlighting the practicality of deploying smaller open-source models in production pipelines. Overall, the work suggests that domain-specific synthetic data and careful instruction tuning can surpass proprietary models, with significant implications for cost-efficient, private, and scalable financial AI systems.

Abstract

Although powerful, current cutting-edge LLMs may not fulfil the needs of highly specialised sectors. We introduce KodeXv0.1, a family of large language models that outclass GPT-4 in financial question answering. We utilise the base variants of Llama 3.1 8B and 70B and adapt them to the financial domain through a custom training regime. To this end, we collect and process a large number of publicly available financial documents such as earnings calls and business reports. These are used to generate a high-quality, synthetic dataset consisting of Context-Question-Answer triplets which closely mirror real-world financial tasks. Using the train split of this dataset, we perform RAG-aware 4bit LoRA instruction tuning runs of Llama 3.1 base variants to produce KodeX-8Bv0.1 and KodeX-70Bv0.1. We then complete extensive model evaluations using FinanceBench, FinQABench and the withheld test split of our dataset. Our results show that KodeX-8Bv0.1 is more reliable in financial contexts than cutting-edge instruct models in the same parameter regime, surpassing them by up to 9.24%. In addition, it is even capable of outperforming state-of-the-art proprietary models such as GPT-4 by up to 7.07%. KodeX-70Bv0.1 represents a further improvement upon this, exceeding GPT-4's performance on every tested benchmark.

KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models

TL;DR

The paper tackles the challenge of delivering reliable financial NLP by developing KodeXv0.1, a family of Llama 3.1–based models fine-tuned with a large, synthetic, multilingual financial corpus. It demonstrates that 8B and 70B variants, trained with 4-bit quantization and LoRA in a post-training regime, achieve state-of-the-art performance across multiple financial benchmarks, including outperforming GPT-4 on all tested metrics. The approach leverages a retrieval-augmented strategy and a rigorous LLM-as-judge evaluation to ensure reliability, while highlighting the practicality of deploying smaller open-source models in production pipelines. Overall, the work suggests that domain-specific synthetic data and careful instruction tuning can surpass proprietary models, with significant implications for cost-efficient, private, and scalable financial AI systems.

Abstract

Although powerful, current cutting-edge LLMs may not fulfil the needs of highly specialised sectors. We introduce KodeXv0.1, a family of large language models that outclass GPT-4 in financial question answering. We utilise the base variants of Llama 3.1 8B and 70B and adapt them to the financial domain through a custom training regime. To this end, we collect and process a large number of publicly available financial documents such as earnings calls and business reports. These are used to generate a high-quality, synthetic dataset consisting of Context-Question-Answer triplets which closely mirror real-world financial tasks. Using the train split of this dataset, we perform RAG-aware 4bit LoRA instruction tuning runs of Llama 3.1 base variants to produce KodeX-8Bv0.1 and KodeX-70Bv0.1. We then complete extensive model evaluations using FinanceBench, FinQABench and the withheld test split of our dataset. Our results show that KodeX-8Bv0.1 is more reliable in financial contexts than cutting-edge instruct models in the same parameter regime, surpassing them by up to 9.24%. In addition, it is even capable of outperforming state-of-the-art proprietary models such as GPT-4 by up to 7.07%. KodeX-70Bv0.1 represents a further improvement upon this, exceeding GPT-4's performance on every tested benchmark.
Paper Structure (17 sections, 8 figures, 2 tables)

This paper contains 17 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Performance comparison between KodeXv0.1 models against open-source instruct models and GPT-4. The 8B variant exhibits best-in-class performance, and the 70B variant outmatches GPT-4 on every benchmark.
  • Figure 2: Our instruction-tuning setup. A benchmark instance is parsed, inserted into the Llama 3.1 instruction-tuning format, and tokenized. The labels are created by shifting the inputs to the left by one, and masking tokens from the instruction with -100. This signals to PyTorch that these labels should be ignored when computing the loss.
  • Figure 3: Distribution of categories from which the documents in the training data were chosen.
  • Figure 4: Example of one training sample.
  • Figure 5: Comparison of the frequency with which KodeX-70Bv0.1 on GPT-4 obtains a certain mark out of 10 on the withheld test set.
  • ...and 3 more figures