Table of Contents
Fetching ...

LongWanjuan: Towards Systematic Measurement for Long Text Quality

Kai Lv, Xiaoran Liu, Qipeng Guo, Hang Yan, Conghui He, Xipeng Qiu, Dahua Lin

TL;DR

This work tackles the lack of systematic evaluation for long-text quality by introducing three linguistically grounded dimensions—coherence, cohesion, and complexity—and a suite of metrics that combine statistical signals with pre-trained-model guidance. It builds LongWanjuan, a bilingual long-text dataset with over $160\mathrm{B}$ tokens, and classifies data into holistic, aggregated, and chaotic types to enable balanced pre-training via a data-mixing recipe. The authors demonstrate that this approach yields significant improvements on long-context benchmarks like LongBench, achieving state-of-the-art performance at the 7B parameter scale. The resource and methodology offer a practical path to better long-text capabilities in foundation models and set the stage for broader multilingual expansion and future refinements.

Abstract

The quality of training data are crucial for enhancing the long-text capabilities of foundation models. Despite existing efforts to refine data quality through heuristic rules and evaluations based on data diversity and difficulty, there's a lack of systematic approaches specifically tailored for assessing long texts. Addressing this gap, our work systematically measures the quality of long texts by evaluating three fundamental linguistic dimensions: coherence, cohesion, and complexity. Drawing inspiration from the aforementioned three dimensions, we introduce a suite of metrics designed to evaluate the quality of long texts, encompassing both statistical and pre-trained language model-based ones. Leveraging these metrics, we present LongWanjuan, a bilingual dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens. In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality. Furthermore, we devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks. The code and dataset are available at https://github.com/OpenLMLab/LongWanjuan.

LongWanjuan: Towards Systematic Measurement for Long Text Quality

TL;DR

This work tackles the lack of systematic evaluation for long-text quality by introducing three linguistically grounded dimensions—coherence, cohesion, and complexity—and a suite of metrics that combine statistical signals with pre-trained-model guidance. It builds LongWanjuan, a bilingual long-text dataset with over tokens, and classifies data into holistic, aggregated, and chaotic types to enable balanced pre-training via a data-mixing recipe. The authors demonstrate that this approach yields significant improvements on long-context benchmarks like LongBench, achieving state-of-the-art performance at the 7B parameter scale. The resource and methodology offer a practical path to better long-text capabilities in foundation models and set the stage for broader multilingual expansion and future refinements.

Abstract

The quality of training data are crucial for enhancing the long-text capabilities of foundation models. Despite existing efforts to refine data quality through heuristic rules and evaluations based on data diversity and difficulty, there's a lack of systematic approaches specifically tailored for assessing long texts. Addressing this gap, our work systematically measures the quality of long texts by evaluating three fundamental linguistic dimensions: coherence, cohesion, and complexity. Drawing inspiration from the aforementioned three dimensions, we introduce a suite of metrics designed to evaluate the quality of long texts, encompassing both statistical and pre-trained language model-based ones. Leveraging these metrics, we present LongWanjuan, a bilingual dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens. In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality. Furthermore, we devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks. The code and dataset are available at https://github.com/OpenLMLab/LongWanjuan.
Paper Structure (24 sections, 3 equations, 12 figures, 13 tables)

This paper contains 24 sections, 3 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: The three dimensions for measuring the quality of long texts: coherence, cohesion and complexity.
  • Figure 2: Pipeline for constructing the LongWanjuan dataset.
  • Figure 3: Distribution of texts with different characteristics on the $\text{Cohesion}_\text{conn}$ metric in the C4 domain.
  • Figure 4: Distribution of token and document counts across different domains. Each bar is divided from left to right into three parts: holistic, aggregated, and chaotic texts.
  • Figure 5: Distribution of token and document counts across different lengths. In LongWanjuan, over 99.9% of the data exceed the truncation length in pre-training.
  • ...and 7 more figures