Table of Contents
Fetching ...

Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization

Arvid E. Gollwitzer, Paridhi Latawa, David de Gruijl, Deepak A. Subramanian, Adrián Noriega de la Colina

TL;DR

This work presents QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction and makes three key contributions: a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization.

Abstract

Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.

Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization

TL;DR

This work presents QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction and makes three key contributions: a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization.

Abstract

Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.
Paper Structure (92 sections, 20 theorems, 41 equations, 21 tables, 9 algorithms)

This paper contains 92 sections, 20 theorems, 41 equations, 21 tables, 9 algorithms.

Key Result

Theorem 3.2

The bilevel optimization problem in Eq. eq:bilevel_objective_main is NP-hard in general dempe2020bilevel; indeed, polynomial bilevel programming is $\Sigma_2^p$-hard cen2023global, placing it one level above NP in the polynomial hierarchy. The worst case requires $O(|\Sigma|^K \cdot K! \cdot N \cdot

Theorems & Definitions (42)

  • Definition 3.1: Bilevel Tokenization Problem
  • Theorem 3.2: Computational Complexity
  • Theorem 3.3: Quality-Aware Merge Score
  • Definition 4.1: Tokenization MDP
  • Proposition 3.1: Boundedness and Continuity of Quality Functions
  • proof
  • Lemma 3.2: First-Order Approximation
  • proof
  • Theorem 3.3: Quality-Aware Merge Score --- Principled Heuristic
  • proof
  • ...and 32 more