Table of Contents
Fetching ...

Rule or Story, Which is a Better Commonsense Expression for Talking with Large Language Models?

Ning Bian, Xianpei Han, Hongyu Lin, Yaojie Lu, Ben He, Le Sun

TL;DR

The paper interrogates whether storytelling or rule-based expressions better capture commonsense for interacting with large language models. Through 28 commonsense QA datasets, it demonstrates that stories generally provide higher retrieval confidence and accuracy, especially for daily-life scenarios, while rules perform better on scientific domains; combining both can yield further gains. It introduces iterative self-supervised fine-tuning (self-SFT) to mitigate commonsense hallucination and semantic drifting in generated stories, achieving additional improvements on both seen and unseen datasets. These findings advocate for selecting language expressions aligned with the type of commonsense and show a path for improving LLM commonsense abilities via self-supervised adaptation. Overall, the work highlights the value of storytelling as a natural medium for expressing and leveraging commonsense in LLMs, while acknowledging domain-specific strengths of rules and the potential of joint approaches.

Abstract

Building machines with commonsense has been a longstanding challenge in NLP due to the reporting bias of commonsense rules and the exposure bias of rule-based commonsense reasoning. In contrast, humans convey and pass down commonsense implicitly through stories. This paper investigates the inherent commonsense ability of large language models (LLMs) expressed through storytelling. We systematically investigate and compare stories and rules for retrieving and leveraging commonsense in LLMs. Experimental results on 28 commonsense QA datasets show that stories outperform rules as the expression for retrieving commonsense from LLMs, exhibiting higher generation confidence and commonsense accuracy. Moreover, stories are the more effective commonsense expression for answering questions regarding daily events, while rules are more effective for scientific questions. This aligns with the reporting bias of commonsense in text corpora. We further show that the correctness and relevance of commonsense stories can be further improved via iterative self-supervised fine-tuning. These findings emphasize the importance of using appropriate language to express, retrieve, and leverage commonsense for LLMs, highlighting a promising direction for better exploiting their commonsense abilities.

Rule or Story, Which is a Better Commonsense Expression for Talking with Large Language Models?

TL;DR

The paper interrogates whether storytelling or rule-based expressions better capture commonsense for interacting with large language models. Through 28 commonsense QA datasets, it demonstrates that stories generally provide higher retrieval confidence and accuracy, especially for daily-life scenarios, while rules perform better on scientific domains; combining both can yield further gains. It introduces iterative self-supervised fine-tuning (self-SFT) to mitigate commonsense hallucination and semantic drifting in generated stories, achieving additional improvements on both seen and unseen datasets. These findings advocate for selecting language expressions aligned with the type of commonsense and show a path for improving LLM commonsense abilities via self-supervised adaptation. Overall, the work highlights the value of storytelling as a natural medium for expressing and leveraging commonsense in LLMs, while acknowledging domain-specific strengths of rules and the potential of joint approaches.

Abstract

Building machines with commonsense has been a longstanding challenge in NLP due to the reporting bias of commonsense rules and the exposure bias of rule-based commonsense reasoning. In contrast, humans convey and pass down commonsense implicitly through stories. This paper investigates the inherent commonsense ability of large language models (LLMs) expressed through storytelling. We systematically investigate and compare stories and rules for retrieving and leveraging commonsense in LLMs. Experimental results on 28 commonsense QA datasets show that stories outperform rules as the expression for retrieving commonsense from LLMs, exhibiting higher generation confidence and commonsense accuracy. Moreover, stories are the more effective commonsense expression for answering questions regarding daily events, while rules are more effective for scientific questions. This aligns with the reporting bias of commonsense in text corpora. We further show that the correctness and relevance of commonsense stories can be further improved via iterative self-supervised fine-tuning. These findings emphasize the importance of using appropriate language to express, retrieve, and leverage commonsense for LLMs, highlighting a promising direction for better exploiting their commonsense abilities.
Paper Structure (38 sections, 2 equations, 11 figures, 12 tables)

This paper contains 38 sections, 2 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Comparison between rules and a story written by ChatGPT. The rules only provide useful knowledge until the 4$^{th}$ rule and also include an incorrect answer option, "classroom". The story presents a detailed scenario where an adult uses glue sticks in an office.
  • Figure 2: Comparison of perplexity reduction between generating stories and rules. Sample size $N=14,000$ for each setting.
  • Figure 3: Comparison of perplexity reduction in generating the correct answer with stories or rules as context. Sample size $N=14,000$ for each setting.
  • Figure 4: Comparison between the accuracy (%) with stories and with rules for Vicuna.
  • Figure 5: The average scores of stories generated by Vicuna of different error types. The dashed lines are the overall average scores among all questions. Error bars indicate 95% confidence intervals.
  • ...and 6 more figures