Bayesian Network Fusion of Large Language Models for Sentiment Analysis
Rasoul Amirzadeh, Dhananjay Thiruvady, Fatemeh Shiri
TL;DR
The paper addresses the challenges of transparency, fine-tuning cost, and cross-domain variability in financial sentiment analysis by introducing Bayesian Network LLM Fusion (BNLF), a late-fusion framework that probabilistically combines predictions from FinBERT, RoBERTa, and BERTweet. By structuring the fusion as a Bayesian network, BNLF provides interpretable uncertainty estimates and causal-like reasoning about how each model and dataset context contribute to the final sentiment decision. Empirically, BNLF yields about a six-percentage-point improvement in accuracy over strong baselines across three diverse financial corpora, while maintaining balanced performance across sentiment classes. The framework is lightweight and inference-focused, enabling robust deployment without extensive fine-tuning, and offers a modular path toward scalable, interpretable AI in finance through probabilistic reasoning and model-agnostic fusion.
Abstract
Large language models (LLMs) continue to advance, with an increasing number of domain-specific variants tailored for specialised tasks. However, these models often lack transparency and explainability, can be costly to fine-tune, require substantial prompt engineering, yield inconsistent results across domains, and impose significant adverse environmental impact due to their high computational demands. To address these challenges, we propose the Bayesian network LLM fusion (BNLF) framework, which integrates predictions from three LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic mechanism for sentiment analysis. BNLF performs late fusion by modelling the sentiment predictions from multiple LLMs as probabilistic nodes within a Bayesian network. Evaluated across three human-annotated financial corpora with distinct linguistic and contextual characteristics, BNLF demonstrates consistent gains of about six percent in accuracy over the baseline LLMs, underscoring its robustness to dataset variability and the effectiveness of probabilistic fusion for interpretable sentiment classification.
