Table of Contents
Fetching ...

Fine-tuning of lightweight large language models for sentiment classification on heterogeneous financial textual data

Alvaro Paredes Amorin, Andre Python, Christoph Weisser

TL;DR

The paper evaluates lightweight open-source LLMs against FinBERT for financial sentiment analysis across English and Chinese sources, using a domain-balanced, PEFT-based fine-tuning pipeline with 4-bit quantization. It finds Qwen3-8B and Llama3-8B Instruct often outperform FinBERT, even with minimal training data, highlighting strong zero- and few-shot capabilities. A domain-balanced training approach improves cross-domain performance and reduces interference, suggesting lightweight LLMs as cost-effective options for heterogeneous financial text. The study also outlines practical guidance for low-resource settings and notes avenues for future multilingual and RAG-enabled extensions.

Abstract

Large language models (LLMs) play an increasingly important role in financial markets analysis by capturing signals from complex and heterogeneous textual data sources, such as tweets, news articles, reports, and microblogs. However, their performance is dependent on large computational resources and proprietary datasets, which are costly, restricted, and therefore inaccessible to many researchers and practitioners. To reflect realistic situations we investigate the ability of lightweight open-source LLMs -- smaller and publicly available models designed to operate with limited computational resources -- to generalize sentiment understanding from financial datasets of varying sizes, sources, formats, and languages. We compare the benchmark finance natural language processing (NLP) model, FinBERT, and three open-source lightweight LLMs, DeepSeek-LLM 7B, Llama3 8B Instruct, and Qwen3 8B on five publicly available datasets: FinancialPhraseBank, Financial Question Answering, Gold News Sentiment, Twitter Sentiment and Chinese Finance Sentiment. We find that LLMs, specially Qwen3 8B and Llama3 8B, perform best in most scenarios, even from using only 5% of the available training data. These results hold in zero-shot and few-shot learning scenarios. Our findings indicate that lightweight, open-source large language models (LLMs) constitute a cost-effective option, as they can achieve competitive performance on heterogeneous textual data even when trained on only a limited subset of the extensive annotated corpora that are typically deemed necessary.

Fine-tuning of lightweight large language models for sentiment classification on heterogeneous financial textual data

TL;DR

The paper evaluates lightweight open-source LLMs against FinBERT for financial sentiment analysis across English and Chinese sources, using a domain-balanced, PEFT-based fine-tuning pipeline with 4-bit quantization. It finds Qwen3-8B and Llama3-8B Instruct often outperform FinBERT, even with minimal training data, highlighting strong zero- and few-shot capabilities. A domain-balanced training approach improves cross-domain performance and reduces interference, suggesting lightweight LLMs as cost-effective options for heterogeneous financial text. The study also outlines practical guidance for low-resource settings and notes avenues for future multilingual and RAG-enabled extensions.

Abstract

Large language models (LLMs) play an increasingly important role in financial markets analysis by capturing signals from complex and heterogeneous textual data sources, such as tweets, news articles, reports, and microblogs. However, their performance is dependent on large computational resources and proprietary datasets, which are costly, restricted, and therefore inaccessible to many researchers and practitioners. To reflect realistic situations we investigate the ability of lightweight open-source LLMs -- smaller and publicly available models designed to operate with limited computational resources -- to generalize sentiment understanding from financial datasets of varying sizes, sources, formats, and languages. We compare the benchmark finance natural language processing (NLP) model, FinBERT, and three open-source lightweight LLMs, DeepSeek-LLM 7B, Llama3 8B Instruct, and Qwen3 8B on five publicly available datasets: FinancialPhraseBank, Financial Question Answering, Gold News Sentiment, Twitter Sentiment and Chinese Finance Sentiment. We find that LLMs, specially Qwen3 8B and Llama3 8B, perform best in most scenarios, even from using only 5% of the available training data. These results hold in zero-shot and few-shot learning scenarios. Our findings indicate that lightweight, open-source large language models (LLMs) constitute a cost-effective option, as they can achieve competitive performance on heterogeneous textual data even when trained on only a limited subset of the extensive annotated corpora that are typically deemed necessary.

Paper Structure

This paper contains 12 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Differences between zero-shot, few-shot and fine-tuning. Zero-shot is using a prompt providing no examples while few-shot does, e.g. 3-shot means using prompts containing 3 examples. Fine-tuning consists of training a pre-trained model or base model with specific-task data.
  • Figure 2: Class distribution per dataset. FinancialPhraseBank (FPB), Financial Question Answering (FiQA), Gold News Sentiment (GSD), Twitter Sentiment (TSD) and Chinese Sentiment (CSD).
  • Figure 3: Fine-tuned and 0, 3, 5-shot learning results for all models. X-axis is shots used or training data proportion used for finetuning. Y-axis is F1 Macro score. In orange, green, red and blue are the results of DeepSeek, Llama, Qwen and FinBERT models respectively.
  • Figure 4: Comparison of training strategies applied to predict sentiment Illustration of the predictive performance of Deepseek LLM 7B using sequential (dashed lines) and balanced (solid lines) fine-tuning methods. The x-axis represents the proportion of data used for training (from 5% to 100%) and the y-axis shows the macro F1 score.