Table of Contents
Fetching ...

Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type

Seokwon Song, Taehyun Lee, Jaewoo Ahn, Jae Hyuk Sung, Gunhee Kim

TL;DR

The paper introduces CCPT, a 12,315-point benchmark for Conceptual Combination with Property Type, designed to evaluate LLMs on three generative and classification tasks that span component, emergent, and canceled properties. It defines robust evaluation metrics for emergence and cancellation and validates them against human judgments, revealing that current LLMs struggle particularly with emergent properties and noun-phrase generation, while a spreading-activation-inspired method yields the best performance. The study also shows GPT-4o’s property-type predictions lag behind human performance, and demonstrates that LLMs can approximate human judgments when guided by structured prompts and concept-graph aids. CCPT thus provides a targeted, cognitively informed framework to probe how models combine concepts and properties, with practical implications for creativity-support, non-literal language understanding, and knowledge-grounded generation.

Abstract

Conceptual combination is a cognitive process that merges basic concepts, enabling the creation of complex expressions. During this process, the properties of combination (e.g., the whiteness of a peeled apple) can be inherited from basic concepts, newly emerge, or be canceled. However, previous studies have evaluated a limited set of properties and have not examined the generative process. To address this gap, we introduce the Conceptual Combination with Property Type dataset (CCPT), which consists of 12.3K annotated triplets of noun phrases, properties, and property types. Using CCPT, we establish three types of tasks to evaluate LLMs for conceptual combination thoroughly. Our key findings are threefold: (1) Our automatic metric grading property emergence and cancellation closely corresponds with human judgments. (2) LLMs, including OpenAI's o1, struggle to generate noun phrases which possess given emergent properties. (3) Our proposed method, inspired by cognitive psychology model that explains how relationships between concepts are formed, improves performances in all generative tasks. The dataset and experimental code are available at https://github.com/seokwon99/CCPT.git.

Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type

TL;DR

The paper introduces CCPT, a 12,315-point benchmark for Conceptual Combination with Property Type, designed to evaluate LLMs on three generative and classification tasks that span component, emergent, and canceled properties. It defines robust evaluation metrics for emergence and cancellation and validates them against human judgments, revealing that current LLMs struggle particularly with emergent properties and noun-phrase generation, while a spreading-activation-inspired method yields the best performance. The study also shows GPT-4o’s property-type predictions lag behind human performance, and demonstrates that LLMs can approximate human judgments when guided by structured prompts and concept-graph aids. CCPT thus provides a targeted, cognitively informed framework to probe how models combine concepts and properties, with practical implications for creativity-support, non-literal language understanding, and knowledge-grounded generation.

Abstract

Conceptual combination is a cognitive process that merges basic concepts, enabling the creation of complex expressions. During this process, the properties of combination (e.g., the whiteness of a peeled apple) can be inherited from basic concepts, newly emerge, or be canceled. However, previous studies have evaluated a limited set of properties and have not examined the generative process. To address this gap, we introduce the Conceptual Combination with Property Type dataset (CCPT), which consists of 12.3K annotated triplets of noun phrases, properties, and property types. Using CCPT, we establish three types of tasks to evaluate LLMs for conceptual combination thoroughly. Our key findings are threefold: (1) Our automatic metric grading property emergence and cancellation closely corresponds with human judgments. (2) LLMs, including OpenAI's o1, struggle to generate noun phrases which possess given emergent properties. (3) Our proposed method, inspired by cognitive psychology model that explains how relationships between concepts are formed, improves performances in all generative tasks. The dataset and experimental code are available at https://github.com/seokwon99/CCPT.git.

Paper Structure

This paper contains 32 sections, 2 equations, 8 figures, 17 tables, 1 algorithm.

Figures (8)

  • Figure 1: Three types of properties derived from conceptual combination with an example of "apple". Different concepts are formed by adding other concepts to "apple". The green properties are component properties of the basic concept "apple". The blue and orange are emergent and canceled properties, respectively.
  • Figure 2: Overview of our data collection pipeline for conceptual combination through automated and human-driven data annotation.
  • Figure 3: Distributions of Pointwise Mutual Information (PMI) on log-2 scale based on the Google Books N-gram Corpus.
  • Figure 4: Correlation between LLM-as-a-judge and human ratings in relevance score, assessing how strong a concept $\mathcal{X}$ possesses a property $\mathcal{P}$. To avoid overlapping points, random jitters sampled from $\mathcal{N}(0, 0.05^2)$ are added to LLM-as-a-judge and human ratings after fitting the regression.
  • Figure 5: Instructions provided for annotators of emergent property data candidates.
  • ...and 3 more figures