Table of Contents
Fetching ...

FinTextSim: Enhancing Financial Text Analysis with BERTopic

Simon Jehnen, Joaquín Ordieres-Meré, Javier Villalba-Díez

TL;DR

This work introduces FinTextSim, a finetuned sentence-transformer tailored for financial text, and assesses its impact on BERTopic for extracting topics from Item 7 and Item 7A of 10-K filings (2016–2022). FinTextSim markedly improves intratopic cohesion and reduces intertopic overlap compared to general-purpose embeddings, enabling BERTopic to produce clear, economically meaningful topic clusters in financial discourse. The study demonstrates that domain-specific embeddings are essential for reliable financial text analysis and highlights potential benefits for business valuation and stock price prediction models. It also discusses the limitations of standard coherence metrics and proposes domain-weighted evaluation to better capture financial topic quality and organizing power.

Abstract

Recent advancements in information availability and computational capabilities have transformed the analysis of annual reports, integrating traditional financial metrics with insights from textual data. To extract valuable insights from this wealth of textual data, automated review processes, such as topic modeling, are crucial. This study examines the effectiveness of BERTopic, a state-of-the-art topic model relying on contextual embeddings, for analyzing Item 7 and Item 7A of 10-K filings from S&P 500 companies (2016-2022). Moreover, we introduce FinTextSim, a finetuned sentence-transformer model optimized for clustering and semantic search in financial contexts. Compared to all-MiniLM-L6-v2, the most widely used sentence-transformer, FinTextSim increases intratopic similarity by 81% and reduces intertopic similarity by 100%, significantly enhancing organizational clarity. We assess BERTopic's performance using embeddings from both FinTextSim and all-MiniLM-L6-v2. Our findings reveal that BERTopic only forms clear and distinct economic topic clusters when paired with FinTextSim's embeddings. Without FinTextSim, BERTopic struggles with misclassification and overlapping topics. Thus, FinTextSim is pivotal for advancing financial text analysis. FinTextSim's enhanced contextual embeddings, tailored for the financial domain, elevate the quality of future research and financial information. This improved quality of financial information will enable stakeholders to gain a competitive advantage, streamlining resource allocation and decision-making processes. Moreover, the improved insights have the potential to leverage business valuation and stock price prediction models.

FinTextSim: Enhancing Financial Text Analysis with BERTopic

TL;DR

This work introduces FinTextSim, a finetuned sentence-transformer tailored for financial text, and assesses its impact on BERTopic for extracting topics from Item 7 and Item 7A of 10-K filings (2016–2022). FinTextSim markedly improves intratopic cohesion and reduces intertopic overlap compared to general-purpose embeddings, enabling BERTopic to produce clear, economically meaningful topic clusters in financial discourse. The study demonstrates that domain-specific embeddings are essential for reliable financial text analysis and highlights potential benefits for business valuation and stock price prediction models. It also discusses the limitations of standard coherence metrics and proposes domain-weighted evaluation to better capture financial topic quality and organizing power.

Abstract

Recent advancements in information availability and computational capabilities have transformed the analysis of annual reports, integrating traditional financial metrics with insights from textual data. To extract valuable insights from this wealth of textual data, automated review processes, such as topic modeling, are crucial. This study examines the effectiveness of BERTopic, a state-of-the-art topic model relying on contextual embeddings, for analyzing Item 7 and Item 7A of 10-K filings from S&P 500 companies (2016-2022). Moreover, we introduce FinTextSim, a finetuned sentence-transformer model optimized for clustering and semantic search in financial contexts. Compared to all-MiniLM-L6-v2, the most widely used sentence-transformer, FinTextSim increases intratopic similarity by 81% and reduces intertopic similarity by 100%, significantly enhancing organizational clarity. We assess BERTopic's performance using embeddings from both FinTextSim and all-MiniLM-L6-v2. Our findings reveal that BERTopic only forms clear and distinct economic topic clusters when paired with FinTextSim's embeddings. Without FinTextSim, BERTopic struggles with misclassification and overlapping topics. Thus, FinTextSim is pivotal for advancing financial text analysis. FinTextSim's enhanced contextual embeddings, tailored for the financial domain, elevate the quality of future research and financial information. This improved quality of financial information will enable stakeholders to gain a competitive advantage, streamlining resource allocation and decision-making processes. Moreover, the improved insights have the potential to leverage business valuation and stock price prediction models.

Paper Structure

This paper contains 30 sections, 15 figures, 9 tables.

Figures (15)

  • Figure 1: FinTextSim vs. AM on the test dataset. The colors of the datapoints represent a topic from the keyword list.
  • Figure 2: Topic representations - FinTextSim vs. AM - HR. Original cleaned sentence: 'a majority of employees belong to labor unions'.
  • Figure 3: Topic representations - FinTextSim vs. AM - Cost. Original cleaned sentence: 'business is vulnerable to fluctuations in fuel costs and disruptions in fuel supplies'.
  • Figure 4: Topic representations - FinTextSim vs. AM - Operations. Original cleaned sentence: 'the company manufactures markets and distributes spices seasoning mixes condiments and other flavorful products to the entire food industry retailers food manufacturers and foodservice businesses'.
  • Figure 5: Topic representations - FinTextSim vs. AM - Accounting. Original cleaned sentence: 'critical accounting policies estimates and judgments our consolidated financial statements are based on gaap which requires us to make estimates and assumptions about future events that affect the amounts reported in our consolidated financial statements'.
  • ...and 10 more figures