Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

Ruchira Dhar; Qiwei Peng; Anders Søgaard

Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

Ruchira Dhar, Qiwei Peng, Anders Søgaard

TL;DR

While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants, highlighting the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.

Abstract

Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective-noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.

Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

TL;DR

Abstract

Paper Structure (23 sections, 13 equations, 4 figures, 2 tables)

This paper contains 23 sections, 13 equations, 4 figures, 2 tables.

Introduction
Experimental Setup
Task Choice
Substitutivity.
Systematicity.
Overgeneralization.
Functional Evaluation
Methodology.
Results.
Representational Evaluation
Methodology.
Results.
Conclusion
Tasks and Datasets
Model Details
...and 8 more sections

Figures (4)

Figure 1: The average performance across different model category (Base, Instruction Tuning, and Large model size) on three tasks. We report the weighted F1 score on AddOne and PLANE, and Accuracy on COMPCOMB.
Figure 2: Layer-wise results (weighted F1 score) on AddOne dataset.
Figure 3: Layer-wise results (weighted F1 score) on PLANE dataset.
Figure 4: Layer-wise results (accuracy) on CompComb dataset.

Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

TL;DR

Abstract

Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

Authors

TL;DR

Abstract

Table of Contents

Figures (4)