BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

Baktash Ansari; Mohammadmostafa Rostamkhani; Sauleh Eetemadi

BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

Baktash Ansari, Mohammadmostafa Rostamkhani, Sauleh Eetemadi

TL;DR

The paper tackles evaluating lateral thinking in NLP via SemEval-2024 Task 9 BRAINTEASER, which comprises Sentence Puzzles and Word Puzzles. It combines finetuning of BERT-Base and RoBERTa-Large with zero-shot Chain-of-Thought prompting across six LLMs and a ReConcile multi-agent consensus mechanism to improve decision quality. RoBERTa-Large finetuning yields the strongest single-model results ($0.766$ on sentence puzzles, $0.645$ on word puzzles), while Copilot dominates zero-shot CoT performance and ReConcile rounds converge to a higher final consensus ($0.758$), with the best reported sentence-puzzle accuracy around $0.85$. The work demonstrates that integrating discriminative fine-tuning with iterative reasoning and cross-model consensus can effectively tackle creative, non-standard reasoning tasks, offering a practical pathway for robust reasoning benchmarks in NLP.

Abstract

This paper outlines our approach to SemEval 2024 Task 9, BRAINTEASER: A Novel Task Defying Common Sense. The task aims to evaluate the ability of language models to think creatively. The dataset comprises multi-choice questions that challenge models to think "outside of the box". We fine-tune 2 models, BERT and RoBERTa Large. Next, we employ a Chain of Thought (CoT) zero-shot prompting approach with 6 large language models, such as GPT-3.5, Mixtral, and Llama2. Finally, we utilize ReConcile, a technique that employs a "round table conference" approach with multiple agents for zero-shot learning, to generate consensus answers among 3 selected language models. Our best method achieves an overall accuracy of 85 percent on the sentence puzzles subtask.

BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

TL;DR

on sentence puzzles,

on word puzzles), while Copilot dominates zero-shot CoT performance and ReConcile rounds converge to a higher final consensus (

), with the best reported sentence-puzzle accuracy around

. The work demonstrates that integrating discriminative fine-tuning with iterative reasoning and cross-model consensus can effectively tackle creative, non-standard reasoning tasks, offering a practical pathway for robust reasoning benchmarks in NLP.

Abstract

Paper Structure (19 sections, 3 equations, 5 figures, 8 tables)

This paper contains 19 sections, 3 equations, 5 figures, 8 tables.

Introduction
Background
Related Works
Datasets
Evaluation Metrics
System overview
Preprocessing
Model Training
Chain of Thought Prompting
ReConcile Round Table
Experiments and Results
Experimental Setup
Results
Conclusion
Training logs
...and 4 more sections

Figures (5)

Figure 1: Chain Of Thought Prompting (GPT3.5)
Figure 2: An Illustration of RECONCILE for Initial Round
Figure 3: Overall Accuracy of Two Models Logged Every 100 Training Steps on Sentence Puzzles.
Figure 4: Overall Accuracy of Two Models Logged Every 100 Training Steps on Word Puzzles.
Figure 5: ReConcile Initial and Discussion Prompts

BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

TL;DR

Abstract

BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

Authors

TL;DR

Abstract

Table of Contents

Figures (5)