Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest
Letian Peng, Zilong Wang, Feng Yao, Jingbo Shang
TL;DR
This work introduces Cuckoo, an information extraction tagger trained via Next Tokens Extraction (NTE), a paradigm that converts duplicative spans in context into BIO-labeled extractive supervision by repurposing next-token prediction data from LLM training. By leveraging large-scale pre-training on C4 (≈100M NTE instances) and post-training on TuluV3 (≈2.6M NTE instances), Cuckoo achieves strong few-shot performance across basic IE, query-based IE, and instruction-following IE, outperforming traditional IE pre-training baselines and even matching or surpassing some LLM-based baselines. The approach demonstrates data efficiency, parameter efficiency, and transferability, with evidence of emergent in-context tagging and robust adaptation as LLM post-training data evolve. Overall, NTE enables scalable IE pre-training as a free rider on LLM resources, providing a practical path to scaling IE capabilities in tandem with advances in LLM data preparation and training pipelines.
Abstract
Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token \emph{prediction} into \emph{extraction} for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, \emph{Cuckoo}, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.
