Table of Contents
Fetching ...

Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability

Zhuoyan Xu, Zhenmei Shi, Yingyu Liang

TL;DR

The paper probes whether large language models can truly compose simple tasks into unseen composite tasks under in-context learning. It combines empirical studies across logical and linguistic composites with a theoretical analysis of a simplified linear self-attention model to identify when separable input parts enable compositional ability and how scaling affects it. The findings show robust compositional performance for separable composites that align with input-subspace separation, while sequential or overlapped-task composites resist improvements from scaling. The work contributes both a testing suite and a theoretical framework, shedding light on the role of input structure and model scale in emergent compositional reasoning, and provides public data and code for replication.

Abstract

Large language models (LLMs) have emerged as powerful tools for many AI problems and exhibit remarkable in-context learning (ICL) capabilities. Compositional ability, solving unseen complex tasks that combine two or more simple tasks, is an essential reasoning ability for Artificial General Intelligence. Despite the tremendous success of LLMs, how they approach composite tasks, especially those not encountered during the pretraining phase, remains an open and largely underexplored question. In this study, we delve into the ICL capabilities of LLMs on composite tasks, with only simple tasks as in-context examples. We develop a test suite of composite tasks including linguistic and logical challenges and perform empirical studies across different LLM families. We observe that models exhibit divergent behaviors: (1) For simpler composite tasks that apply distinct mapping mechanisms to different input segments, the models demonstrate decent compositional ability, while scaling up the model enhances this ability; (2) for more complex composite tasks involving reasoning multiple steps, where each step represents one task, models typically underperform, and scaling up generally provides no improvements. We offer theoretical analysis in a simplified setting, explaining that models exhibit compositional capability when the task handles different input parts separately. We believe our work sheds new light on the capabilities of LLMs in solving composite tasks regarding the nature of the tasks and model scale. Our dataset and code are available at {\url{https://github.com/OliverXUZY/LLM_Compose}}.

Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability

TL;DR

The paper probes whether large language models can truly compose simple tasks into unseen composite tasks under in-context learning. It combines empirical studies across logical and linguistic composites with a theoretical analysis of a simplified linear self-attention model to identify when separable input parts enable compositional ability and how scaling affects it. The findings show robust compositional performance for separable composites that align with input-subspace separation, while sequential or overlapped-task composites resist improvements from scaling. The work contributes both a testing suite and a theoretical framework, shedding light on the role of input structure and model scale in emergent compositional reasoning, and provides public data and code for replication.

Abstract

Large language models (LLMs) have emerged as powerful tools for many AI problems and exhibit remarkable in-context learning (ICL) capabilities. Compositional ability, solving unseen complex tasks that combine two or more simple tasks, is an essential reasoning ability for Artificial General Intelligence. Despite the tremendous success of LLMs, how they approach composite tasks, especially those not encountered during the pretraining phase, remains an open and largely underexplored question. In this study, we delve into the ICL capabilities of LLMs on composite tasks, with only simple tasks as in-context examples. We develop a test suite of composite tasks including linguistic and logical challenges and perform empirical studies across different LLM families. We observe that models exhibit divergent behaviors: (1) For simpler composite tasks that apply distinct mapping mechanisms to different input segments, the models demonstrate decent compositional ability, while scaling up the model enhances this ability; (2) for more complex composite tasks involving reasoning multiple steps, where each step represents one task, models typically underperform, and scaling up generally provides no improvements. We offer theoretical analysis in a simplified setting, explaining that models exhibit compositional capability when the task handles different input parts separately. We believe our work sheds new light on the capabilities of LLMs in solving composite tasks regarding the nature of the tasks and model scale. Our dataset and code are available at {\url{https://github.com/OliverXUZY/LLM_Compose}}.
Paper Structure (27 sections, 10 theorems, 43 equations, 4 figures, 9 tables)

This paper contains 27 sections, 10 theorems, 43 equations, 4 figures, 9 tables.

Key Result

Theorem 1

Consider distinct tasks $k$ and $g$ with corresponding examples $\mathcal{S}_k, \mathcal{S}_g$. If two tasks have confined support, and assum:lambda is true, then with high probability, the model has the compositional ability as defined in defn:Compositional. Moreover,

Figures (4)

  • Figure 1: Inconsistent performance in GPT-4. Consider two simple tasks: If a word is followed by an asterisk (*), capitalize the letter. If two words are surrounded by parentheses, swap the positions. GPT-4 correctly solves two simple tasks based on demonstrations (left). The composite tasks have test inputs with both asterisk (*) and parenthesis. The correct answer should be output: SPORTS PIE. However, GPT-4 fails to solve the composite tasks (right). The same failure was observed in Claude 3.
  • Figure 2: The exact match accuracy ($y$-axis) vs the model scale ($x$-axis, "b" stands for billion) for Capitalization & Swap tasks (example in \ref{['fig:dia']}). Line capital: performance on the simple task of capitalization; swap: on the simple task of swap; composite: in-context examples are from simple tasks while test input from the composite task. composite incontext: in-context examples and test input are all from the composite task (example in \ref{['tab:upper_swap_example']}).
  • Figure 3: The word error rate (WER) vs the model scale on composite linguistic translation tasks. Dashed lines: simple tasks. Solid lines: composite tasks. Rows: (T1) Phrase Recombination with Longer Chain; (T2) Passive to Active and Object to Subject Transformation. Columns: different models. Lines: performance in different evaluation settings, e.g., the two simple tasks, the composite setting, and the composite in-context setting (examples are shown in \ref{['app:subsec:Translation Tasks']}).
  • Figure 4: The accuracy v.s. model scale on composite logical rule tasks. Dashed lines: simple tasks. Solid lines: composite tasks. Rows: (A) + (C) Capitalization & Two Sum; (G) + (H) Modular & Two Sum Plus; (A) + (F) Capitalization & Plus One. Columns: different models. Lines: performance in different evaluation settings, i.e., the two simple tasks, the composite setting, and the composite in-context setting (examples for the last two are shown in \ref{['tab:upper_swap_example']}).

Theorems & Definitions (17)

  • Definition 1: Compositional Ability
  • Definition 2: Confined Support
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Lemma E.1: Lemma 5.3 in zhang2023trained
  • Theorem 2
  • proof : Proof of \ref{['propAcc']}
  • Corollary 1
  • proof : Proof of \ref{['propOverlap']}
  • ...and 7 more