Table of Contents
Fetching ...

Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest

Letian Peng, Zilong Wang, Feng Yao, Jingbo Shang

TL;DR

This work introduces Cuckoo, an information extraction tagger trained via Next Tokens Extraction (NTE), a paradigm that converts duplicative spans in context into BIO-labeled extractive supervision by repurposing next-token prediction data from LLM training. By leveraging large-scale pre-training on C4 (≈100M NTE instances) and post-training on TuluV3 (≈2.6M NTE instances), Cuckoo achieves strong few-shot performance across basic IE, query-based IE, and instruction-following IE, outperforming traditional IE pre-training baselines and even matching or surpassing some LLM-based baselines. The approach demonstrates data efficiency, parameter efficiency, and transferability, with evidence of emergent in-context tagging and robust adaptation as LLM post-training data evolve. Overall, NTE enables scalable IE pre-training as a free rider on LLM resources, providing a practical path to scaling IE capabilities in tandem with advances in LLM data preparation and training pipelines.

Abstract

Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token \emph{prediction} into \emph{extraction} for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, \emph{Cuckoo}, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.

Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest

TL;DR

This work introduces Cuckoo, an information extraction tagger trained via Next Tokens Extraction (NTE), a paradigm that converts duplicative spans in context into BIO-labeled extractive supervision by repurposing next-token prediction data from LLM training. By leveraging large-scale pre-training on C4 (≈100M NTE instances) and post-training on TuluV3 (≈2.6M NTE instances), Cuckoo achieves strong few-shot performance across basic IE, query-based IE, and instruction-following IE, outperforming traditional IE pre-training baselines and even matching or surpassing some LLM-based baselines. The approach demonstrates data efficiency, parameter efficiency, and transferability, with evidence of emergent in-context tagging and robust adaptation as LLM post-training data evolve. Overall, NTE enables scalable IE pre-training as a free rider on LLM resources, providing a practical path to scaling IE capabilities in tandem with advances in LLM data preparation and training pipelines.

Abstract

Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token \emph{prediction} into \emph{extraction} for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, \emph{Cuckoo}, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.

Paper Structure

This paper contains 47 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Cuckoo takes a free ride on LLM resources (e.g., C4 and TuluV3 tulu3) by formalizing next token prediction for duplicative spans as extraction in the BIO paradigm. During the inference, the prompts can be adjusted to different extractive tasks, making Cuckoo a versatile IE model.
  • Figure 2: Comparison of scale, cost, and diversity among different IE pre-training datasets. Our data collection for Cuckoo is free by converting LLM's learning resources, which forces the tagger to learn from diverse contexts. Cuckoo can also evolve with the data collection for LLM's post-training.
  • Figure 3: The evolution of Cuckoo with LLM's post-training resources. Domain $[\mu-2\sigma, \mu+2\sigma]$ is annotated under each evaluation dimension.
  • Figure 4: In-context tagging ability emerges in Cuckoo but not in IE models pre-trained by other resources.
  • Figure 5: The data scaling trend of Cuckoo on the early $4.1$M C4 instances and the massive $100$M instances.
  • ...and 2 more figures