Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning
Cole Gawin, Yidan Sun, Mayank Kejriwal
TL;DR
This work tackles the challenge of evaluating abstract common-sense reasoning in large language models by leveraging ConceptNet as a knowledge base and two prompting paradigms: instruct prompting with explicit relation names and few-shot prompting with unnamed relations. It introduces rigorous, replication-friendly datasets and metrics (NDCG for ranking and Cohen's $\kappa$ for relation identification) and analyzes model behavior across varying candidate-relation counts. Findings reveal that while LLMs exhibit some grasp of common-sense semantics, they struggle with single-relation predictions and show biases when options are abundant, though performance improves with constrained choices. The study suggests that careful prompt engineering with selective retrieval holds promise for elevating abstract reasoning in practical settings and outlines directions for validating across more models and retrieval-augmented frameworks.
Abstract
Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving reasoning tasks of moderate complexity, such as question-answering and mathematical problem-solving. However, their capabilities in tasks requiring deeper cognitive skills, such as common-sense understanding and abstract reasoning, remain under-explored. In this paper, we systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph. We propose two prompting approaches: instruct prompting, where models predict plausible semantic relationships based on provided definitions, and few-shot prompting, where models identify relations using examples as guidance. Our experiments with the gpt-4o-mini model show that in instruct prompting, consistent performance is obtained when ranking multiple relations but with substantial decline when the model is restricted to predicting only one relation. In few-shot prompting, the model's accuracy improves significantly when selecting from five relations rather than the full set, although with notable bias toward certain relations. These results suggest significant gaps still, even in commercially used LLMs' abstract common-sense reasoning abilities, compared to human-level understanding. However, the findings also highlight the promise of careful prompt engineering, based on selective retrieval, for obtaining better performance.
