Concept-aware Data Construction Improves In-context Learning of Language Models
Michal Štefánik, Marek Kadlčík, Petr Sojka
TL;DR
The paper addresses why in-context learning emerges in language models and challenges the notion that only scale or task diversity matter. It introduces Concept-aware Training (CoAT), a data-construction framework that forces models to learn and apply latent reasoning concepts from demonstrations by enforcing informativeness and non-triviality in the training prompts. Through a two-stage regime—synthetic TeaBReaC data with concept annotations followed by natural-language AdversarialQA fine-tuning—it demonstrates that models can acquire robust concept-utilization for unseen tasks and exhibit improved robustness to semantic priors, achieving practical performance on 70+ tasks with far less data than traditional multitask approaches. The findings suggest a data-centric path to enhancing ICL, including transfer from synthetic to natural concepts and strong competitiveness with multitask learners, with broad implications for data-efficient, domain-adaptive in-context learning and potential applicability to low-resource languages.
Abstract
Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training data and creates functional in-context learners even in small-scale, synthetic settings. In this work, we practically explore this newly identified axis of ICL quality. We propose Concept-aware Training (CoAT), a framework for constructing training scenarios that make it beneficial for the LM to learn to utilize the analogical reasoning concepts from demonstrations. We find that by using CoAT, pre-trained transformers can learn to better utilise new latent concepts from demonstrations and that such ability makes ICL more robust to the functional deficiencies of the previous models. Finally, we show that concept-aware in-context learning is more effective for a majority of new tasks when compared to traditional instruction tuning, resulting in a performance comparable to the previous in-context learners using magnitudes of more training data.
