Table of Contents
Fetching ...

BeanCounter: A low-toxicity, large-scale, and open dataset of business-oriented text

Siyan Wang, Bradford Levy

TL;DR

This work introduces BeanCounter, a public dataset consisting of more than 159B tokens extracted from businesses' disclosures and suggests that BeanCounter is a novel source of low-toxicity and high-quality domain-specific data with sufficient scale to train multi-billion parameter LLMs.

Abstract

Many of the recent breakthroughs in language modeling have resulted from scaling effectively the same model architecture to larger datasets. In this vein, recent work has highlighted performance gains from increasing training dataset size and quality, suggesting a need for novel sources of large-scale datasets. In this work, we introduce BeanCounter, a public dataset consisting of more than 159B tokens extracted from businesses' disclosures. We show that this data is indeed novel: less than 0.1% of BeanCounter appears in Common Crawl-based datasets and it is an order of magnitude larger than datasets relying on similar sources. Given the data's provenance, we hypothesize that BeanCounter is comparatively more factual and less toxic than web-based datasets. Exploring this hypothesis, we find that many demographic identities occur with similar prevalence in BeanCounter but with significantly less toxic context relative to other datasets. To demonstrate the utility of BeanCounter, we evaluate and compare two LLMs continually pre-trained on BeanCounter with their base models. We find an 18-33% reduction in toxic generation and improved performance within the finance domain for the continually pretrained models. Collectively, our work suggests that BeanCounter is a novel source of low-toxicity and high-quality domain-specific data with sufficient scale to train multi-billion parameter LLMs.

BeanCounter: A low-toxicity, large-scale, and open dataset of business-oriented text

TL;DR

This work introduces BeanCounter, a public dataset consisting of more than 159B tokens extracted from businesses' disclosures and suggests that BeanCounter is a novel source of low-toxicity and high-quality domain-specific data with sufficient scale to train multi-billion parameter LLMs.

Abstract

Many of the recent breakthroughs in language modeling have resulted from scaling effectively the same model architecture to larger datasets. In this vein, recent work has highlighted performance gains from increasing training dataset size and quality, suggesting a need for novel sources of large-scale datasets. In this work, we introduce BeanCounter, a public dataset consisting of more than 159B tokens extracted from businesses' disclosures. We show that this data is indeed novel: less than 0.1% of BeanCounter appears in Common Crawl-based datasets and it is an order of magnitude larger than datasets relying on similar sources. Given the data's provenance, we hypothesize that BeanCounter is comparatively more factual and less toxic than web-based datasets. Exploring this hypothesis, we find that many demographic identities occur with similar prevalence in BeanCounter but with significantly less toxic context relative to other datasets. To demonstrate the utility of BeanCounter, we evaluate and compare two LLMs continually pre-trained on BeanCounter with their base models. We find an 18-33% reduction in toxic generation and improved performance within the finance domain for the continually pretrained models. Collectively, our work suggests that BeanCounter is a novel source of low-toxicity and high-quality domain-specific data with sufficient scale to train multi-billion parameter LLMs.
Paper Structure (43 sections, 16 figures, 17 tables)

This paper contains 43 sections, 16 figures, 17 tables.

Figures (16)

  • Figure 1: Overview of dataset construction: All EDGAR filings are downloaded, text is extracted from filings using content-type specific extractors, extracted text is then cleaned and deduplicated.
  • Figure 2: Text volume by year and form type.
  • Figure 3: Top 10 industries with the highest token volume.
  • Figure 4: Firms that contribute the most to textual volume.
  • Figure 5: Changes in safety scores across 13 demographic target groups for Pythia-1.4B and Phi-1.5 after continued pre-training on BeanCounter. The safety score ranges from 0 to 1 where a higher score indicates a lower likelihood for the model to produce toxic generation relative to benign generation.
  • ...and 11 more figures