Table of Contents
Fetching ...

propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Maximilian Idahl, Benedikt Droste, Björn Plüster, Jan Philipp Harries

TL;DR

This work introduces propella-1, a family of small multilingual LLMs that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance.

Abstract

Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.

propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

TL;DR

This work introduces propella-1, a family of small multilingual LLMs that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance.

Abstract

Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.
Paper Structure (97 sections, 1 equation, 7 figures, 6 tables)

This paper contains 97 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overall annotation agreement scores across all evaluated models. propella-1-4b exceeds Gemini-3-Flash and significantly larger open models.
  • Figure 2: Per-property annotation agreement scores across all evaluated models (12 properties). See Figure \ref{['fig:perproperty_full']} in Appendix \ref{['app:full_results']} for the full breakdown of all 17 properties.
  • Figure 3: Property distributions across four German-language pretraining sources. Sources differ dramatically across quality dimensions despite all being German text corpora. FinePDFs shows substantially higher rates of excellent quality, analytical reasoning, and high educational value.
  • Figure 4: Prevalence of quality issues in Nemotron-CC quality tiers. Even the "high" quality tier contains documents with issues on specific dimensions (commercial bias, information density, content integrity) that the single-score classifier does not capture.
  • Figure 5: Property distributions across six languages in FineWeb-2. The rightmost column shows the range (max $-$ min) across languages. Commercial bias and information density exhibit the largest cross-language variation, while educational value and reasoning indicators are more uniform. These differences motivate language-specific filtering strategies.
  • ...and 2 more figures