Table of Contents
Fetching ...

Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models

Jiaao Chen, Xiaoman Pan, Dian Yu, Kaiqiang Song, Xiaoyang Wang, Dong Yu, Jianshu Chen

TL;DR

The paper tackles the limited compositional generalization of LLMs by introducing Skills-in-Context (SKiC) prompting, a one-stage in-context framework that grounds reasoning in a curated set of foundational skills within the prompt. SKiC demonstrates near-perfect systematic generalization across diverse tasks by coupling explicit skill grounding with compositional exemplars and allowing the model to leverage both in-context and internal pre-trained skills. The approach shows strong transfer to new tasks and enables improved instruction tuning via SKiC-structured data, suggesting practical pathways to enhance reasoning capabilities in real-world settings. Overall, SKiC provides a simple, robust mechanism to unlock compositionality in LLMs and offers clear directions for scaling skill sets and integrating external tools in future work.

Abstract

We investigate how to elicit compositional generalization capabilities in large language models (LLMs). Compositional generalization empowers LLMs to solve complex problems by combining foundational skills, a critical reasoning ability akin to human intelligence. However, even the most advanced LLMs currently struggle with this form of reasoning. We examine this problem within the framework of in-context learning and find that demonstrating both foundational skills and compositional examples grounded in these skills within the same prompt context is crucial. We refer to this prompt structure as skills-in-context (SKiC). With as few as two exemplars, this in-context learning structure enables LLMs to tackle more challenging problems requiring innovative skill combinations, achieving near-perfect systematic generalization across a broad range of tasks. Intriguingly, SKiC also unlocks the latent potential of LLMs, allowing them to more actively utilize pre-existing internal skills acquired during earlier pretraining stages to solve complex reasoning problems. The SKiC structure is robust across different skill constructions and exemplar choices and demonstrates strong transferability to new tasks. Finally, inspired by our in-context learning study, we show that fine-tuning LLMs with SKiC-style data can elicit zero-shot weak-to-strong generalization, enabling the models to solve much harder problems directly with standard prompting.

Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models

TL;DR

The paper tackles the limited compositional generalization of LLMs by introducing Skills-in-Context (SKiC) prompting, a one-stage in-context framework that grounds reasoning in a curated set of foundational skills within the prompt. SKiC demonstrates near-perfect systematic generalization across diverse tasks by coupling explicit skill grounding with compositional exemplars and allowing the model to leverage both in-context and internal pre-trained skills. The approach shows strong transfer to new tasks and enables improved instruction tuning via SKiC-structured data, suggesting practical pathways to enhance reasoning capabilities in real-world settings. Overall, SKiC provides a simple, robust mechanism to unlock compositionality in LLMs and offers clear directions for scaling skill sets and integrating external tools in future work.

Abstract

We investigate how to elicit compositional generalization capabilities in large language models (LLMs). Compositional generalization empowers LLMs to solve complex problems by combining foundational skills, a critical reasoning ability akin to human intelligence. However, even the most advanced LLMs currently struggle with this form of reasoning. We examine this problem within the framework of in-context learning and find that demonstrating both foundational skills and compositional examples grounded in these skills within the same prompt context is crucial. We refer to this prompt structure as skills-in-context (SKiC). With as few as two exemplars, this in-context learning structure enables LLMs to tackle more challenging problems requiring innovative skill combinations, achieving near-perfect systematic generalization across a broad range of tasks. Intriguingly, SKiC also unlocks the latent potential of LLMs, allowing them to more actively utilize pre-existing internal skills acquired during earlier pretraining stages to solve complex reasoning problems. The SKiC structure is robust across different skill constructions and exemplar choices and demonstrates strong transferability to new tasks. Finally, inspired by our in-context learning study, we show that fine-tuning LLMs with SKiC-style data can elicit zero-shot weak-to-strong generalization, enabling the models to solve much harder problems directly with standard prompting.
Paper Structure (49 sections, 39 figures, 25 tables)

This paper contains 49 sections, 39 figures, 25 tables.

Figures (39)

  • Figure 1: Skills-in-Context Prompting. The prompt consists of three blocks: (i) the (basic) skills for solving a complex task, (ii) examples of how to compose the skills, and (iii) the problem to be solved. The above prompt will be fed into an LLM to generate the output --- see Figure \ref{['Tab:example_last_letter_skill']} for an example of the output. Note that the compositional exemplars demonstrate how to explicitly ground the reasoning steps onto the basic skills (highlighted in colors).
  • Figure 2: An example of the generated solution on the MATH task using SKiC. Intriguingly, the two highlighted skills $<$Angle Bisector Theorem$>$ and $<$Heron's Formula$>$ are neither provided in the SKiC context (see Figure \ref{['Tab:math_skill']}) nor used in any given exemplars. LLMs harness the internal skills in their pre-trained knowledge to solve the problem, where these two highlighted skill names are also generated automatically by the LLM.
  • Figure 3: Accuracy on last letter concatenation, addition, multiplication, and dynamic programming. The gray area is in-distribution evaluation where the testing examples are with the same level of complexity as examples in the context, while the white area is out-of-distribution evaluation where the test set are increasingly harder problems.
  • Figure 4: Exact Match on Commaqa-E. The "Comp. Gen" reports the results on the compositional questions.
  • Figure 5: The accuracy on GSM8K tasks.
  • ...and 34 more figures