Table of Contents
Fetching ...

Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning

Cole Gawin, Yidan Sun, Mayank Kejriwal

TL;DR

This work tackles the challenge of evaluating abstract common-sense reasoning in large language models by leveraging ConceptNet as a knowledge base and two prompting paradigms: instruct prompting with explicit relation names and few-shot prompting with unnamed relations. It introduces rigorous, replication-friendly datasets and metrics (NDCG for ranking and Cohen's $\kappa$ for relation identification) and analyzes model behavior across varying candidate-relation counts. Findings reveal that while LLMs exhibit some grasp of common-sense semantics, they struggle with single-relation predictions and show biases when options are abundant, though performance improves with constrained choices. The study suggests that careful prompt engineering with selective retrieval holds promise for elevating abstract reasoning in practical settings and outlines directions for validating across more models and retrieval-augmented frameworks.

Abstract

Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving reasoning tasks of moderate complexity, such as question-answering and mathematical problem-solving. However, their capabilities in tasks requiring deeper cognitive skills, such as common-sense understanding and abstract reasoning, remain under-explored. In this paper, we systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph. We propose two prompting approaches: instruct prompting, where models predict plausible semantic relationships based on provided definitions, and few-shot prompting, where models identify relations using examples as guidance. Our experiments with the gpt-4o-mini model show that in instruct prompting, consistent performance is obtained when ranking multiple relations but with substantial decline when the model is restricted to predicting only one relation. In few-shot prompting, the model's accuracy improves significantly when selecting from five relations rather than the full set, although with notable bias toward certain relations. These results suggest significant gaps still, even in commercially used LLMs' abstract common-sense reasoning abilities, compared to human-level understanding. However, the findings also highlight the promise of careful prompt engineering, based on selective retrieval, for obtaining better performance.

Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning

TL;DR

This work tackles the challenge of evaluating abstract common-sense reasoning in large language models by leveraging ConceptNet as a knowledge base and two prompting paradigms: instruct prompting with explicit relation names and few-shot prompting with unnamed relations. It introduces rigorous, replication-friendly datasets and metrics (NDCG for ranking and Cohen's for relation identification) and analyzes model behavior across varying candidate-relation counts. Findings reveal that while LLMs exhibit some grasp of common-sense semantics, they struggle with single-relation predictions and show biases when options are abundant, though performance improves with constrained choices. The study suggests that careful prompt engineering with selective retrieval holds promise for elevating abstract reasoning in practical settings and outlines directions for validating across more models and retrieval-augmented frameworks.

Abstract

Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving reasoning tasks of moderate complexity, such as question-answering and mathematical problem-solving. However, their capabilities in tasks requiring deeper cognitive skills, such as common-sense understanding and abstract reasoning, remain under-explored. In this paper, we systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph. We propose two prompting approaches: instruct prompting, where models predict plausible semantic relationships based on provided definitions, and few-shot prompting, where models identify relations using examples as guidance. Our experiments with the gpt-4o-mini model show that in instruct prompting, consistent performance is obtained when ranking multiple relations but with substantial decline when the model is restricted to predicting only one relation. In few-shot prompting, the model's accuracy improves significantly when selecting from five relations rather than the full set, although with notable bias toward certain relations. These results suggest significant gaps still, even in commercially used LLMs' abstract common-sense reasoning abilities, compared to human-level understanding. However, the findings also highlight the promise of careful prompt engineering, based on selective retrieval, for obtaining better performance.

Paper Structure

This paper contains 4 sections, 3 figures.

Figures (3)

  • Figure 1: Schematized representation of two prompting templates/approaches (instruction prompting and few-shot prompting) for evaluating an LLM on abstract common-sense reasoning. The actual prompts given to the model are instantiated using the ConceptNet knowledge graph.
  • Figure 2: Distribution of NDCG scores grouped by relations across different K values. Boxplots represent quartiles and ranges for average NDCG scores as the model ranks different numbers of relations (top 10, 5, 3, and 1; shown as NDCG@1, NDCG@3, NDCG@5, NDCG@10, and NDCG@Full). Individual gray points show the average NDCG score for each relation within each K category. Mean NDCG scores are represented by black dots inside each box.
  • Figure 3: Comparison between two few-shot prompting experimental conditions: when the model was provided with only five (lighter shade) versus full set (darker shade) of possible relations. Note that the lighter shaded portion shows the additional performance gain achieved when limiting choices to 5 relations, with the total height of each bar (dark + light portions combined) representing performance in the 5-relation setting. F1 scores for these are explicitly noted above each bar-set.