Language Models can Exploit Cross-Task In-context Learning for Data-Scarce Novel Tasks

Anwoy Chatterjee; Eshaan Tanwar; Subhabrata Dutta; Tanmoy Chakraborty

Language Models can Exploit Cross-Task In-context Learning for Data-Scarce Novel Tasks

Anwoy Chatterjee, Eshaan Tanwar, Subhabrata Dutta, Tanmoy Chakraborty

TL;DR

This work investigates how large language models can solve novel, data-scarce tasks by transferring contextual signals from a library of predefined tasks through cross-task prompting. It demonstrates substantial zero-shot improvements across multiple model sizes, with semantic similarity-based source demonstrations and task definitions guiding the transfer. To address label-space misalignment and data scarcity, the authors propose pseudo-label generation via cross-task prompting, which yields performance close to gold-labelled in-task prompts for several models. Interpretability analyses reveal that cross-task gains are tied to activation similarities between source and target task representations, offering initial mechanistic insights into how information flows across tasks within Transformer-based architectures. Overall, the study provides a practical, gradient-free pathway to extend LLM applicability to new domains while outlining limitations and directions for deeper theoretical understanding.

Abstract

Large Language Models (LLMs) have transformed NLP with their remarkable In-context Learning (ICL) capabilities. Automated assistants based on LLMs are gaining popularity; however, adapting them to novel tasks is still challenging. While colossal models excel in zero-shot performance, their computational demands limit widespread use, and smaller language models struggle without context. This paper investigates whether LLMs can generalize from labeled examples of predefined tasks to novel tasks. Drawing inspiration from biological neurons and the mechanistic interpretation of the Transformer architecture, we explore the potential for information sharing across tasks. We design a cross-task prompting setup with three LLMs and show that LLMs achieve significant performance improvements despite no examples from the target task in the context. Cross-task prompting leads to a remarkable performance boost of 107% for LLaMA-2 7B, 18.6% for LLaMA-2 13B, and 3.2% for GPT 3.5 on average over zero-shot prompting, and performs comparable to standard in-context learning. The effectiveness of generating pseudo-labels for in-task examples is demonstrated, and our analyses reveal a strong correlation between the effect of cross-task examples and model activation similarities in source and target input tokens. This paper offers a first-of-its-kind exploration of LLMs' ability to solve novel tasks based on contextual signals from different task examples.

Language Models can Exploit Cross-Task In-context Learning for Data-Scarce Novel Tasks

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 5 figures, 20 tables)

This paper contains 18 sections, 3 equations, 5 figures, 20 tables.

Introduction
Prompting techniques
Results and Analysis
Pseudo-label generation using cross-task prompting
Interpretability analysis
Error analysis
Related work
Conclusion
Dataset details
Source datasets
Target datasets
Semantic context creation
Cross-task prompting in instruction-tuned models
Force decoding
Random labeling
...and 3 more sections

Figures (5)

Figure 1: In this working example, we aim to solve a question from MedMCQA using demonstrations from BoolQ. To do so, we sample a semantically similar demonstration from the source task and then use this demonstration along with task descriptors of the source and target tasks to generate a cross-task prompt that is fed to an LLM.
Figure 2: Variation of accuracy with the number of cross-task demonstrations in the context C for 8-shot prompt, where C consists of a mixture of cross-task and in-task examples, using (a) LLaMA-2 7B, (b) LLaMA-2 13B and (c) GPT3.5. Except for Financial-Phrasebank, an increase in cross-task examples give a stable performance.
Figure 3: Variation of accuracy with change in the number of -- (a) source demonstrations from $D_{s}$ (the performance largely stagnates at $k=1$), (b) pseudo-label sampled from $D_{pl}$ (the performance seems to increase with an increase in $k$). All experiments are done on LLaMA-2 7B model.
Figure 4: Heat map of cosine similarities between the mean activations of the task definitions for source-target pairs, extracted from the final layer of (a) LLaMA-2 7B (b) LLaMA-2 13B. The values in the cells show the absolute performance improvement over zero-shot prompting. For a target task, the source-target pair with the highest cosine similarity shows the most improvement in 80% cases. (ARC-C: ARC-Challenge, F-Phr: Financial-Phrasebank, MMCQ: MedMCQA, S-QA: Social-i-QA, AGN: AG-news, ARCE: ARC-Easy, Com-QA: Commonsense-QA, C-POS: Conll2003-POS, C-NER: Conll2003-NER).
Figure 5: Variation of (a) Spearman's correlation coefficient, (b) Pearson correlation coefficient and (c) Kendall’s tau, with layers as reported in Table \ref{['tab:layer_correlation']} for LLaMA-2 7B.

Language Models can Exploit Cross-Task In-context Learning for Data-Scarce Novel Tasks

TL;DR

Abstract

Language Models can Exploit Cross-Task In-context Learning for Data-Scarce Novel Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)