Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance

Meni Brief; Oded Ovadia; Gil Shenderovitz; Noga Ben Yoash; Rachel Lemberg; Eitam Sheetrit

Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance

Meni Brief, Oded Ovadia, Gil Shenderovitz, Noga Ben Yoash, Rachel Lemberg, Eitam Sheetrit

TL;DR

The paper investigates multi-task fine-tuning for finance-domain LLMs and demonstrates a cocktail effect where combining related financial tasks yields superior task performance over single-task fine-tuning. Through a large-scale ablation study spanning roughly 220 training runs across four models and nine datasets, the authors show that multi-task training can push a 3.8B Phi-3-Mini model to outperform GPT-4-o on several benchmarks, and even achieve state-of-the-art results on some finance tasks. They also explore using general instruction data as regularization and include mathematical data to boost numerical reasoning, finding improved performance on numerical tasks but limited transfer to broad domain knowledge or complex reasoning. These results underscore the value of cross-task learning for task-specific finance applications and point to hybrid strategies to balance task proficiency with domain understanding.

Abstract

The application of large language models (LLMs) in domain-specific contexts, including finance, has expanded rapidly. Domain-specific LLMs are typically evaluated based on their performance in various downstream tasks relevant to the domain. In this work, we present a detailed analysis of fine-tuning LLMs for such tasks. Somewhat counterintuitively, we find that in domain-specific cases, fine-tuning exclusively on the target task is not always the most effective strategy. Instead, multi-task finetuning - where models are trained on a cocktail of related tasks - can significantly enhance performance. We demonstrate how this approach enables a small model, such as Phi-3-Mini, to achieve state-of-the-art results, even surpassing the much larger GPT-4-o model on financial benchmarks. Our study involves a large-scale experiment, conducting over 200 training experiments using several widely adopted LLMs as baselines, and empirically confirms the benefits of multi-task fine-tuning. Additionally, we explore the use of general instruction data as a form of regularization, suggesting that it helps minimize performance degradation. We also investigate the inclusion of mathematical data, finding improvements in numerical reasoning that transfer effectively to financial tasks. Finally, we note that while fine-tuning for downstream tasks leads to targeted improvements in task performance, it does not necessarily result in broader gains in domain knowledge or complex domain reasoning abilities.

Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance

TL;DR

Abstract

Paper Structure (20 sections, 5 equations, 5 figures, 7 tables)

This paper contains 20 sections, 5 equations, 5 figures, 7 tables.

Introduction
Multi-task Fine-Tuning
Background
Problem Formulation
Methodology
Datasets
Core Financial Datasets
General Training Datasets
Additional Evaluation Datasets
Evaluation and Results
Experiment Setup
Metrics
Main Results
Related Work
Conclusions
...and 5 more sections

Figures (5)

Figure 1: A comparison of performance across financial tasks between GPT-4-o, the baseline Phi-3-Mini model, and the best results achieved by multi-task fine-tuning of Phi-3-Mini.
Figure 2: Overview of the methodology. The steps are: $\binom{n}{0} \rightarrow \binom{n}{1} \rightarrow \binom{n}{2} \rightarrow \binom{n}{n-1} \rightarrow \binom{n}{n}$.
Figure 3: A visualization of \ref{['tab:results']}. The experiment results for single-task and multi-task fine-tuning, aggregated across all experiments.
Figure 4: Normalized averaged scores for all seven core tasks described in \ref{['datasets:core_datasets']} across all experiments. Each point represents the average score for a single fine-tuned model. The colors represent the type of datasets used in the experiment.
Figure 5: Evaluation scores of all four models on all seven core tasks described in \ref{['datasets:core_datasets']}. The relative gain (in percentage) is reported of each fine-tuning experiment.

Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance

TL;DR

Abstract

Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance

Authors

TL;DR

Abstract

Table of Contents

Figures (5)