Table of Contents
Fetching ...

Concept-aware Data Construction Improves In-context Learning of Language Models

Michal Štefánik, Marek Kadlčík, Petr Sojka

TL;DR

The paper addresses why in-context learning emerges in language models and challenges the notion that only scale or task diversity matter. It introduces Concept-aware Training (CoAT), a data-construction framework that forces models to learn and apply latent reasoning concepts from demonstrations by enforcing informativeness and non-triviality in the training prompts. Through a two-stage regime—synthetic TeaBReaC data with concept annotations followed by natural-language AdversarialQA fine-tuning—it demonstrates that models can acquire robust concept-utilization for unseen tasks and exhibit improved robustness to semantic priors, achieving practical performance on 70+ tasks with far less data than traditional multitask approaches. The findings suggest a data-centric path to enhancing ICL, including transfer from synthetic to natural concepts and strong competitiveness with multitask learners, with broad implications for data-efficient, domain-adaptive in-context learning and potential applicability to low-resource languages.

Abstract

Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training data and creates functional in-context learners even in small-scale, synthetic settings. In this work, we practically explore this newly identified axis of ICL quality. We propose Concept-aware Training (CoAT), a framework for constructing training scenarios that make it beneficial for the LM to learn to utilize the analogical reasoning concepts from demonstrations. We find that by using CoAT, pre-trained transformers can learn to better utilise new latent concepts from demonstrations and that such ability makes ICL more robust to the functional deficiencies of the previous models. Finally, we show that concept-aware in-context learning is more effective for a majority of new tasks when compared to traditional instruction tuning, resulting in a performance comparable to the previous in-context learners using magnitudes of more training data.

Concept-aware Data Construction Improves In-context Learning of Language Models

TL;DR

The paper addresses why in-context learning emerges in language models and challenges the notion that only scale or task diversity matter. It introduces Concept-aware Training (CoAT), a data-construction framework that forces models to learn and apply latent reasoning concepts from demonstrations by enforcing informativeness and non-triviality in the training prompts. Through a two-stage regime—synthetic TeaBReaC data with concept annotations followed by natural-language AdversarialQA fine-tuning—it demonstrates that models can acquire robust concept-utilization for unseen tasks and exhibit improved robustness to semantic priors, achieving practical performance on 70+ tasks with far less data than traditional multitask approaches. The findings suggest a data-centric path to enhancing ICL, including transfer from synthetic to natural concepts and strong competitiveness with multitask learners, with broad implications for data-efficient, domain-adaptive in-context learning and potential applicability to low-resource languages.

Abstract

Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training data and creates functional in-context learners even in small-scale, synthetic settings. In this work, we practically explore this newly identified axis of ICL quality. We propose Concept-aware Training (CoAT), a framework for constructing training scenarios that make it beneficial for the LM to learn to utilize the analogical reasoning concepts from demonstrations. We find that by using CoAT, pre-trained transformers can learn to better utilise new latent concepts from demonstrations and that such ability makes ICL more robust to the functional deficiencies of the previous models. Finally, we show that concept-aware in-context learning is more effective for a majority of new tasks when compared to traditional instruction tuning, resulting in a performance comparable to the previous in-context learners using magnitudes of more training data.
Paper Structure (35 sections, 1 equation, 9 figures, 5 tables)

This paper contains 35 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Example of training instruction constructed from synthetic TeaBReaC dataset where demonstrations share analogical reasoning chain. In Concept-aware Training (CoAT), we construct such examples to train in-context learners to utilise latent reasoning concepts whenever available in demonstrations.
  • Figure 2: Demonstrations selection in Concept-aware training (CoAT): From all samples of the training dataset, we first (1) filter out ones sharing a specific reasoning concept with predicted sample $(x_\text{pred}, y_\text{pred})$. From this subset, we (2) iteratively pick the candidate demonstration(s) $c_i$ such that the trained model $\Theta$'s probability of generating the correct prediction $y_\text{pred}$ if we pick $c_i$ among demonstrations is minimal.
  • Figure 3: In-context learning of new concepts: Relative change of performance of models when presented with demonstrations exhibiting a reasoning concept informative for prediction. Evaluation with (left) synthetic TeaBReaC samples, and (right) diverse concepts of natural datasets (§\ref{['sec:rq1']}).
  • Figure 4: Models' reliance on semantic priors: Relative change of models' performance when we (left)replace labels with 'non-sensical' tokens with no correspondence to the semantics of the task, such as 'foo', 'bar', etc.; and (right)flip the original labels, so that e.g. 'negative' label corresponds to a positive-sentiment sample. CoAT models can in-context learn the input-output mapping similarly well with non-sensical labels and rely on the labels' semantics significantly less than previous in-context learners (in grey).
  • Figure 5: Effectiveness of Concept-aware training: Natural-Instructions: Win rate of models utilising Concept-aware training (CoAT; §\ref{['sec:coat']}) and traditional instruction tuning (Tk-Random; §\ref{['sec:baselines']}) evaluated on (top) all and (bottom) reasoning tasks of Natural-Instructions collection. Values indicate the number of tasks where the referenced model reaches significantly higher accuracy than the other. For the similar tasks, the difference in models' performance is not statistically significant.
  • ...and 4 more figures