From Words to Worlds: Compositionality for Cognitive Architectures
Ruchira Dhar, Anders Søgaard
TL;DR
The paper investigates whether large language models exhibit true compositionality and whether such compositionality explains performance. By analyzing four model families across three tasks (ANTAILS, PLANE, COMPCOMB) and contrasting scaling with instruction tuning, it finds that model size generally enhances both performance and compositional behavior, whereas instruction tuning yields inconsistent or sometimes detrimental effects. It introduces COMPCOMB as a novel embedding-space test and demonstrates that representations at different layers capture compositional structure unevenly, with last-layer signals typically more faithful than early embeddings. The work highlights that scaling may partly account for performance gains through improved compositional strategies, but emphasizes that compositionality alone does not fully explain success, pointing to nuanced, task- and model-specific dynamics and the need for broader, interpretable analyses.
Abstract
Large language models (LLMs) are very performant connectionist systems, but do they exhibit more compositionality? More importantly, is that part of why they perform so well? We present empirical analyses across four LLM families (12 models) and three task categories, including a novel task introduced below. Our findings reveal a nuanced relationship in learning of compositional strategies by LLMs -- while scaling enhances compositional abilities, instruction tuning often has a reverse effect. Such disparity brings forth some open issues regarding the development and improvement of large language models in alignment with human cognitive capacities.
