Table of Contents
Fetching ...

AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles

Ioannis Panagiotopoulos, Giorgos Filandrianos, Maria Lymperaiou, Giorgos Stamou

TL;DR

The study benchmarks two brainteaser sub-tasks from SemEval-2024 by comparing encoder-based fine-tuning with instruction-tuned LLMs. By transforming problems into both multi-class and binary formats and applying commonsense pre-training plus LoRA/QLoRA tuning, the authors achieve competitive results and analyze reasoning failures. Key findings show substantial gains from in-domain pre-training on encoder models and that the Mistral-7b LLM frequently outperforms other models, including larger LLMs, particularly on word puzzles. The work provides practical guidance on model selection, data utilization, and hyperparameter choices for reasoning-oriented NLP tasks, along with insights into the nature of brainteaser reasoning and model explanations. Overall, the results underscore the value of targeted transfer learning and structured evaluation for advancing flexible, out-of-the-box reasoning in NLP systems.

Abstract

In this paper, we outline our submission for the SemEval-2024 Task 9 competition: 'BRAINTEASER: A Novel Task Defying Common Sense'. We engage in both sub-tasks: Sub-task A-Sentence Puzzle and Sub-task B-Word Puzzle. We evaluate a plethora of pre-trained transformer-based language models of different sizes through fine-tuning. Subsequently, we undertake an analysis of their scores and responses to aid future researchers in understanding and utilizing these models effectively. Our top-performing approaches secured competitive positions on the competition leaderboard across both sub-tasks. In the evaluation phase, our best submission attained an average accuracy score of 81.7% in the Sentence Puzzle, and 85.4% in the Word Puzzle, significantly outperforming the best neural baseline (ChatGPT) by more than 20% and 30% respectively.

AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles

TL;DR

The study benchmarks two brainteaser sub-tasks from SemEval-2024 by comparing encoder-based fine-tuning with instruction-tuned LLMs. By transforming problems into both multi-class and binary formats and applying commonsense pre-training plus LoRA/QLoRA tuning, the authors achieve competitive results and analyze reasoning failures. Key findings show substantial gains from in-domain pre-training on encoder models and that the Mistral-7b LLM frequently outperforms other models, including larger LLMs, particularly on word puzzles. The work provides practical guidance on model selection, data utilization, and hyperparameter choices for reasoning-oriented NLP tasks, along with insights into the nature of brainteaser reasoning and model explanations. Overall, the results underscore the value of targeted transfer learning and structured evaluation for advancing flexible, out-of-the-box reasoning in NLP systems.

Abstract

In this paper, we outline our submission for the SemEval-2024 Task 9 competition: 'BRAINTEASER: A Novel Task Defying Common Sense'. We engage in both sub-tasks: Sub-task A-Sentence Puzzle and Sub-task B-Word Puzzle. We evaluate a plethora of pre-trained transformer-based language models of different sizes through fine-tuning. Subsequently, we undertake an analysis of their scores and responses to aid future researchers in understanding and utilizing these models effectively. Our top-performing approaches secured competitive positions on the competition leaderboard across both sub-tasks. In the evaluation phase, our best submission attained an average accuracy score of 81.7% in the Sentence Puzzle, and 85.4% in the Word Puzzle, significantly outperforming the best neural baseline (ChatGPT) by more than 20% and 30% respectively.
Paper Structure (27 sections, 7 tables)