Table of Contents
Fetching ...

From Words to Worlds: Compositionality for Cognitive Architectures

Ruchira Dhar, Anders Søgaard

TL;DR

The paper investigates whether large language models exhibit true compositionality and whether such compositionality explains performance. By analyzing four model families across three tasks (ANTAILS, PLANE, COMPCOMB) and contrasting scaling with instruction tuning, it finds that model size generally enhances both performance and compositional behavior, whereas instruction tuning yields inconsistent or sometimes detrimental effects. It introduces COMPCOMB as a novel embedding-space test and demonstrates that representations at different layers capture compositional structure unevenly, with last-layer signals typically more faithful than early embeddings. The work highlights that scaling may partly account for performance gains through improved compositional strategies, but emphasizes that compositionality alone does not fully explain success, pointing to nuanced, task- and model-specific dynamics and the need for broader, interpretable analyses.

Abstract

Large language models (LLMs) are very performant connectionist systems, but do they exhibit more compositionality? More importantly, is that part of why they perform so well? We present empirical analyses across four LLM families (12 models) and three task categories, including a novel task introduced below. Our findings reveal a nuanced relationship in learning of compositional strategies by LLMs -- while scaling enhances compositional abilities, instruction tuning often has a reverse effect. Such disparity brings forth some open issues regarding the development and improvement of large language models in alignment with human cognitive capacities.

From Words to Worlds: Compositionality for Cognitive Architectures

TL;DR

The paper investigates whether large language models exhibit true compositionality and whether such compositionality explains performance. By analyzing four model families across three tasks (ANTAILS, PLANE, COMPCOMB) and contrasting scaling with instruction tuning, it finds that model size generally enhances both performance and compositional behavior, whereas instruction tuning yields inconsistent or sometimes detrimental effects. It introduces COMPCOMB as a novel embedding-space test and demonstrates that representations at different layers capture compositional structure unevenly, with last-layer signals typically more faithful than early embeddings. The work highlights that scaling may partly account for performance gains through improved compositional strategies, but emphasizes that compositionality alone does not fully explain success, pointing to nuanced, task- and model-specific dynamics and the need for broader, interpretable analyses.

Abstract

Large language models (LLMs) are very performant connectionist systems, but do they exhibit more compositionality? More importantly, is that part of why they perform so well? We present empirical analyses across four LLM families (12 models) and three task categories, including a novel task introduced below. Our findings reveal a nuanced relationship in learning of compositional strategies by LLMs -- while scaling enhances compositional abilities, instruction tuning often has a reverse effect. Such disparity brings forth some open issues regarding the development and improvement of large language models in alignment with human cognitive capacities.
Paper Structure (12 sections, 9 figures, 6 tables)

This paper contains 12 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Model Accuracy trends for two setups (combined) with the ANTAILS Dataset.
  • Figure 2: Heatmap for three model types and four model families on the ANTAILS dataset.
  • Figure 3: Model accuracy trends for PLANE dataset.
  • Figure 4: Average accuracies of models across 3 classes of adjectives.
  • Figure 5: Model Accuracy trends for two setups with the COMPCOMB Dataset
  • ...and 4 more figures