Table of Contents
Fetching ...

Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally

Agam Shah, Siddhant Sukhani, Huzaifa Pardawala, Saketh Budideti, Riya Bhadani, Rudra Gopal, Siddhartha Somani, Rutwik Routu, Michael Galarnyk, Soungmin Lee, Arnav Hiray, Akshar Ravichandran, Eric Kim, Pranav Aluru, Joshua Zhang, Sebastian Jaskowski, Veer Guda, Meghaj Tarte, Liqin Ye, Spencer Gosden, Rachel Yuh, Sloka Chava, Sahasra Chava, Dylan Patrick Kelly, Aiden Chiang, Harsit Mittal, Sudheer Chava

TL;DR

This work introduces the World Central Banks (WCB) dataset, the largest corpus of monetary policy communications to date, spanning 1996–2024 across 25 central banks and yielding over 380k sentences with 1k annotations per bank for three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation. It demonstrates that a General Setup model trained on aggregated data significantly outperforms bank-specific models, evidencing cross-bank transfer learning and shared linguistic structure in central bank communications. The authors benchmark 7 pretrained language models and 9 large language models, perform extensive ablations (including few-shot and annotation-guided prompting), and provide a rich set of artifacts (datasets, annotations, fine-tuned models, and benchmarks) under CC-BY-NC-SA 4.0 via HuggingFace and GitHub. Economic analysis links stance signals to inflation dynamics, while human evaluation and error analyses validate practical utility and highlight domain-specific challenges. The work emphasizes broad reproducibility, discusses global coverage gaps and ethical considerations, and showcases transferability to non-financial domains, underscoring the framework’s significance for policy analysis, forecasting, and governance research.

Abstract

Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank's data, confirming the principle "the whole is greater than the sum of its parts." Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework's economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.

Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally

TL;DR

This work introduces the World Central Banks (WCB) dataset, the largest corpus of monetary policy communications to date, spanning 1996–2024 across 25 central banks and yielding over 380k sentences with 1k annotations per bank for three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation. It demonstrates that a General Setup model trained on aggregated data significantly outperforms bank-specific models, evidencing cross-bank transfer learning and shared linguistic structure in central bank communications. The authors benchmark 7 pretrained language models and 9 large language models, perform extensive ablations (including few-shot and annotation-guided prompting), and provide a rich set of artifacts (datasets, annotations, fine-tuned models, and benchmarks) under CC-BY-NC-SA 4.0 via HuggingFace and GitHub. Economic analysis links stance signals to inflation dynamics, while human evaluation and error analyses validate practical utility and highlight domain-specific challenges. The work emphasizes broad reproducibility, discusses global coverage gaps and ethical considerations, and showcases transferability to non-financial domains, underscoring the framework’s significance for policy analysis, forecasting, and governance research.

Abstract

Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank's data, confirming the principle "the whole is greater than the sum of its parts." Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework's economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.

Paper Structure

This paper contains 98 sections, 4 equations, 40 figures, 84 tables.

Figures (40)

  • Figure 1: A summary of the World Central Bank (WCB) dataset and experiments. We systematically collect, clean and research the communications from 1996–2024 of 25 central banks at a sentence level, leading to 380,200 sentences (avg. 27.06 words/sentence) in our corpus. We present an annotated dataset consisting of 25,000 sentences across three tasks (Stance Detection, Temporal Classification, and Uncertainty Estimation) using comprehensive individual annotation guides and detailed instructions for annotation. We benchmark seven PLMs and eight LLMs on these tasks, under a bank-specific (1,000 bank-specific annotated sentences) and global setup (25,000 annotated sentences). The performance of the General (All-Banks) Setup model for each task is showcased in the figure. In these tables, $^*$ represents that it is an average.
  • Figure 2: The dataset generation process for each central bank across the corpus of 25 central banks consists of three stages. (1a) We collected data from the central bank communications and converted them to Markdown files if they were PDFs. (2a) The data was cleaned using regex patterns and tokenized as described in Appendix \ref{['app:datasetconstructioncleaning']}. (3a) The data for each central bank was consolidated into JSON files as described in \ref{['app:Metadata']}. (1b) Annotators were divided into 25 groups of four. (2b) They researched their assigned central bank and labeled a sample of 100 sentences. (3b) Individually drafted annotation guides by each group member were consolidated into a collaborative document for each bank. (4) Independently, the group members annotated the corresponding sentences using their annotation guide. (5) In pairs, they compared their annotations and resolved disagreements (Appendix \ref{['app:annotation_guides']}). (6) An expert annotator performed a final review of all the annotations by each group.
  • Figure 3: Impact of varying training sample sizes on average test F1-Score of the best performing model using the entire annotated dataset. Performance rapidly improves until 600 samples, after which performance starts to plateau. We first average the results over the 3 random seeds for each bank, and then take the mean of those averages across all banks.
  • Figure 4: Template of JSON-style formatting for the metadata of each central bank in the corpus.
  • Figure 5: Tile 1 displays the interface to annotate sentences. Tile 2 displays the "Reviewer" view, allowing annotators to view disagreements after their initial annotations are complete. Tiles 3a, 3b, and 3c display the process of updating the annotation after pairs have reached a consensus.
  • ...and 35 more figures