uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?
Pouya Sadeghi, Amirhossein Abaskohi, Yadollah Yaghoobzadeh
TL;DR
This work probes whether LLMs exhibit true lateral thinking by evaluating prompting strategies (CoT variants, task-description compression) and retrieval-augmented in-context learning on SemEval-2024 Task 9 BrainTeaser data. It compares GPT-3.5, GPT-4, and Zephyr-7B-$\beta$ across CoT, enhanced prompts, and dynamic RAG-based few-shot inference, including a thesis-style external-CoT approach. Key findings show compressed informative prompts and dynamic in-context learning improve lateral-thinking performance, with some models (notably larger and better-trained ones) exhibiting stronger capabilities; fine-tuning Zephyr on a lateral-thinking dataset yields transfer gains to SWAG and CommonsenseQA. The results underscore the value of innovative prompt design and targeted fine-tuning for enhancing out-of-the-box reasoning in LLMs, offering practical pathways to bolster commonsense and puzzle-solving tasks across datasets.
Abstract
Inspired by human cognition, Jiang et al.(2023c) create a benchmark for assessing LLMs' lateral thinking-thinking outside the box. Building upon this benchmark, we investigate how different prompting methods enhance LLMs' performance on this task to reveal their inherent power for outside-the-box thinking ability. Through participating in SemEval-2024, task 9, Sentence Puzzle sub-task, we explore prompt engineering methods: chain of thoughts (CoT) and direct prompting, enhancing with informative descriptions, and employing contextualizing prompts using a retrieval augmented generation (RAG) pipeline. Our experiments involve three LLMs including GPT-3.5, GPT-4, and Zephyr-7B-beta. We generate a dataset of thinking paths between riddles and options using GPT-4, validated by humans for quality. Findings indicate that compressed informative prompts enhance performance. Dynamic in-context learning enhances model performance significantly. Furthermore, fine-tuning Zephyr on our dataset enhances performance across other commonsense datasets, underscoring the value of innovative thinking.
