uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?

Pouya Sadeghi; Amirhossein Abaskohi; Yadollah Yaghoobzadeh

uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?

Pouya Sadeghi, Amirhossein Abaskohi, Yadollah Yaghoobzadeh

TL;DR

This work probes whether LLMs exhibit true lateral thinking by evaluating prompting strategies (CoT variants, task-description compression) and retrieval-augmented in-context learning on SemEval-2024 Task 9 BrainTeaser data. It compares GPT-3.5, GPT-4, and Zephyr-7B-$\beta$ across CoT, enhanced prompts, and dynamic RAG-based few-shot inference, including a thesis-style external-CoT approach. Key findings show compressed informative prompts and dynamic in-context learning improve lateral-thinking performance, with some models (notably larger and better-trained ones) exhibiting stronger capabilities; fine-tuning Zephyr on a lateral-thinking dataset yields transfer gains to SWAG and CommonsenseQA. The results underscore the value of innovative prompt design and targeted fine-tuning for enhancing out-of-the-box reasoning in LLMs, offering practical pathways to bolster commonsense and puzzle-solving tasks across datasets.

Abstract

Inspired by human cognition, Jiang et al.(2023c) create a benchmark for assessing LLMs' lateral thinking-thinking outside the box. Building upon this benchmark, we investigate how different prompting methods enhance LLMs' performance on this task to reveal their inherent power for outside-the-box thinking ability. Through participating in SemEval-2024, task 9, Sentence Puzzle sub-task, we explore prompt engineering methods: chain of thoughts (CoT) and direct prompting, enhancing with informative descriptions, and employing contextualizing prompts using a retrieval augmented generation (RAG) pipeline. Our experiments involve three LLMs including GPT-3.5, GPT-4, and Zephyr-7B-beta. We generate a dataset of thinking paths between riddles and options using GPT-4, validated by humans for quality. Findings indicate that compressed informative prompts enhance performance. Dynamic in-context learning enhances model performance significantly. Furthermore, fine-tuning Zephyr on our dataset enhances performance across other commonsense datasets, underscoring the value of innovative thinking.

uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?

TL;DR

across CoT, enhanced prompts, and dynamic RAG-based few-shot inference, including a thesis-style external-CoT approach. Key findings show compressed informative prompts and dynamic in-context learning improve lateral-thinking performance, with some models (notably larger and better-trained ones) exhibiting stronger capabilities; fine-tuning Zephyr on a lateral-thinking dataset yields transfer gains to SWAG and CommonsenseQA. The results underscore the value of innovative prompt design and targeted fine-tuning for enhancing out-of-the-box reasoning in LLMs, offering practical pathways to bolster commonsense and puzzle-solving tasks across datasets.

Abstract

Paper Structure (29 sections, 5 figures, 6 tables)

This paper contains 29 sections, 5 figures, 6 tables.

Introduction
Background
Chain of Thoughts Prompting.
Enhanced Prompting Strategies.
In-context Learning.
Methodology
Dataset
BrainTeaser.
Additional Datasets.
Task Informative Context
Thinking Strategy
In-context Learning
Ordinary RAG.
Ranked RAG.
RAG Fusion.
...and 14 more sections

Figures (5)

Figure 1: A sample from the sentence puzzle sub-task with an explanation of how this puzzle deprecates default commonsense.
Figure 2: An illustration of our rag-fusion setup. Using an LLM, we generate four variations of the original sample to identify similar ones in the dataset, then rank them to find the closest matches. See appendix \ref{['appndx:ragEE']} for more details and used prompts.
Figure 3: An overview of our approaches in solving the BrainTeaser riddles. In this setup, we have a direct prompt that asks the model to find the appropriate answer. To provide more information to the model, we can offer some task explanation, with the compressed version depicted in this figure. Finally, we utilize our RAG setup to provide the model with in-context examples. In some experiments, we also include the theses for each question-option pair in the prompt, serving as an unbiased link between the question and the option.
Figure 4: Different prompting approaches and how they affect the model's performance. GPT-3.5-baseline reported by jiang-etal-2023-brainteaser.
Figure D.1: RAG Fusion. The four used variants include: (I) The original riddle, (II) Context reconstruction obtained from semantically reconstructed samples originating from the original riddle, (III) Context reconstruction derived from the original riddle, (IV) Context reconstructed from step 3, then we retrieve similar samples for each variant. In the end, we feed retrieved documents to a ranker to filter them based on similarity and usefulness.

uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?

TL;DR

Abstract

uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)