Table of Contents
Fetching ...

Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

Ruchira Dhar, Qiwei Peng, Anders Søgaard

TL;DR

While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants, highlighting the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.

Abstract

Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective-noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.

Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

TL;DR

While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants, highlighting the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.

Abstract

Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective-noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.
Paper Structure (23 sections, 13 equations, 4 figures, 2 tables)

This paper contains 23 sections, 13 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The average performance across different model category (Base, Instruction Tuning, and Large model size) on three tasks. We report the weighted F1 score on AddOne and PLANE, and Accuracy on COMPCOMB.
  • Figure 2: Layer-wise results (weighted F1 score) on AddOne dataset.
  • Figure 3: Layer-wise results (weighted F1 score) on PLANE dataset.
  • Figure 4: Layer-wise results (accuracy) on CompComb dataset.