Table of Contents
Fetching ...

Bayesian Network Fusion of Large Language Models for Sentiment Analysis

Rasoul Amirzadeh, Dhananjay Thiruvady, Fatemeh Shiri

TL;DR

The paper addresses the challenges of transparency, fine-tuning cost, and cross-domain variability in financial sentiment analysis by introducing Bayesian Network LLM Fusion (BNLF), a late-fusion framework that probabilistically combines predictions from FinBERT, RoBERTa, and BERTweet. By structuring the fusion as a Bayesian network, BNLF provides interpretable uncertainty estimates and causal-like reasoning about how each model and dataset context contribute to the final sentiment decision. Empirically, BNLF yields about a six-percentage-point improvement in accuracy over strong baselines across three diverse financial corpora, while maintaining balanced performance across sentiment classes. The framework is lightweight and inference-focused, enabling robust deployment without extensive fine-tuning, and offers a modular path toward scalable, interpretable AI in finance through probabilistic reasoning and model-agnostic fusion.

Abstract

Large language models (LLMs) continue to advance, with an increasing number of domain-specific variants tailored for specialised tasks. However, these models often lack transparency and explainability, can be costly to fine-tune, require substantial prompt engineering, yield inconsistent results across domains, and impose significant adverse environmental impact due to their high computational demands. To address these challenges, we propose the Bayesian network LLM fusion (BNLF) framework, which integrates predictions from three LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic mechanism for sentiment analysis. BNLF performs late fusion by modelling the sentiment predictions from multiple LLMs as probabilistic nodes within a Bayesian network. Evaluated across three human-annotated financial corpora with distinct linguistic and contextual characteristics, BNLF demonstrates consistent gains of about six percent in accuracy over the baseline LLMs, underscoring its robustness to dataset variability and the effectiveness of probabilistic fusion for interpretable sentiment classification.

Bayesian Network Fusion of Large Language Models for Sentiment Analysis

TL;DR

The paper addresses the challenges of transparency, fine-tuning cost, and cross-domain variability in financial sentiment analysis by introducing Bayesian Network LLM Fusion (BNLF), a late-fusion framework that probabilistically combines predictions from FinBERT, RoBERTa, and BERTweet. By structuring the fusion as a Bayesian network, BNLF provides interpretable uncertainty estimates and causal-like reasoning about how each model and dataset context contribute to the final sentiment decision. Empirically, BNLF yields about a six-percentage-point improvement in accuracy over strong baselines across three diverse financial corpora, while maintaining balanced performance across sentiment classes. The framework is lightweight and inference-focused, enabling robust deployment without extensive fine-tuning, and offers a modular path toward scalable, interpretable AI in finance through probabilistic reasoning and model-agnostic fusion.

Abstract

Large language models (LLMs) continue to advance, with an increasing number of domain-specific variants tailored for specialised tasks. However, these models often lack transparency and explainability, can be costly to fine-tune, require substantial prompt engineering, yield inconsistent results across domains, and impose significant adverse environmental impact due to their high computational demands. To address these challenges, we propose the Bayesian network LLM fusion (BNLF) framework, which integrates predictions from three LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic mechanism for sentiment analysis. BNLF performs late fusion by modelling the sentiment predictions from multiple LLMs as probabilistic nodes within a Bayesian network. Evaluated across three human-annotated financial corpora with distinct linguistic and contextual characteristics, BNLF demonstrates consistent gains of about six percent in accuracy over the baseline LLMs, underscoring its robustness to dataset variability and the effectiveness of probabilistic fusion for interpretable sentiment classification.

Paper Structure

This paper contains 14 sections, 1 equation, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Each input text, drawn from multiple financial and social media corpora, is processed by three LLM-based classifiers: FinBERT, RoBERTa, and BERTweet. These models generate individual sentiment predictions that are fused within a BN through probabilistic inference. The BN outputs a posterior sentiment distribution, which is then mapped to a discrete sentiment label: negative (NEG), neutral (NEU), or positive (POS).
  • Figure 2: A real example from the dataset showing how BNLF fuses sentiment predictions from FinBERT, RoBERTa, and BERTweet. BN integrates these individual predictions to generate final sentiment probabilities of (POS = 0.3436, NEU = 0.6513, NEG = 0.0051), where the neutral class is eventually selected with the highest probability.
  • Figure 3: Overall performance comparison of BNLF, individual LLMs, ensemble baselines, and the external DistilRoBERTa model across accuracy, macro-F1, and weighted-F1 metrics. The blue bars representing BNLF consistently exceed those of all baselines, including the ensemble methods (majority voting and averaging), demonstrating the effectiveness of its probabilistic fusion approach.
  • Figure 4: Accuracy comparison across datasets, where each group of bars corresponds to one dataset (Financial PhraseBank, FIQA, TFNS), with bars representing indivicual LLMs, and BNLF. It shows that BNLF achieves the highest accuracy on FIQA and TFNS, while DistilRoBERTa reaches the highest accuracy on Financial PhraseBank.
  • Figure 5: Heatmap of pairwise agreement scores between individual LLMs and BNLF. Darker shades indicate stronger agreement, corresponding to higher proportions of matching sentiment labels.
  • ...and 3 more figures