Table of Contents
Fetching ...

Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning

Qiming Bao, Gael Gendron, Alex Yuxuan Peng, Wanjun Zhong, Neset Tan, Yang Chen, Michael Witbrock, Jiamou Liu

TL;DR

This work investigates how large language models generalise and remain robust in logical reasoning when task structures are perturbed. It introduces three augmented benchmarks—ReClor-plus, LogiQA-plus, and LogiQAv2-plus—with three perturbations to probe reasoning beyond memorised patterns, and evaluates both discriminative and generative LLMs. The findings show that standard models falter under perturbations, while instruction fine-tuning and logic-driven data augmentation substantially improve robustness; model size alone does not guarantee better generalisation. The results point to practical pathways for robustness through data-centric augmentation and prompting strategies, offering guidance for future evaluation and training of reasoning-focused LLMs.

Abstract

Large language models (LLMs), such as LLaMA, Alpaca, Vicuna, GPT-3.5 and GPT-4, have advanced the performance of AI systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness when performing logical reasoning has not been sufficiently assessed. To comprehensively evaluate this ability, we develop three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus" that extend standard logical reasoning datasets to evaluate the robustness of the LLM's reasoning. For each, we create three subsets: the first with randomly shuffled options, the second with the correct choices replaced by "none of the other options is correct", and the third with a combination of shuffling and substitution. Experiments on these datasets show that these simple augmentations greatly hinder the models' performance. Despite their high performance on the original publicly available datasets, we find that all models perform poorly on these newly constructed datasets. We also demonstrate that introducing task variations into the training set can markedly improve the model's performance on both the original and our developed datasets. Finally, we show that applying logic-driven data augmentation for fine-tuning and prompting can enhance generalisation in both discriminative and generative models, offering a path to improving their robustness for tasks involving logical reasoning. Source code and data are made publicly available at https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.

Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning

TL;DR

This work investigates how large language models generalise and remain robust in logical reasoning when task structures are perturbed. It introduces three augmented benchmarks—ReClor-plus, LogiQA-plus, and LogiQAv2-plus—with three perturbations to probe reasoning beyond memorised patterns, and evaluates both discriminative and generative LLMs. The findings show that standard models falter under perturbations, while instruction fine-tuning and logic-driven data augmentation substantially improve robustness; model size alone does not guarantee better generalisation. The results point to practical pathways for robustness through data-centric augmentation and prompting strategies, offering guidance for future evaluation and training of reasoning-focused LLMs.

Abstract

Large language models (LLMs), such as LLaMA, Alpaca, Vicuna, GPT-3.5 and GPT-4, have advanced the performance of AI systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness when performing logical reasoning has not been sufficiently assessed. To comprehensively evaluate this ability, we develop three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus" that extend standard logical reasoning datasets to evaluate the robustness of the LLM's reasoning. For each, we create three subsets: the first with randomly shuffled options, the second with the correct choices replaced by "none of the other options is correct", and the third with a combination of shuffling and substitution. Experiments on these datasets show that these simple augmentations greatly hinder the models' performance. Despite their high performance on the original publicly available datasets, we find that all models perform poorly on these newly constructed datasets. We also demonstrate that introducing task variations into the training set can markedly improve the model's performance on both the original and our developed datasets. Finally, we show that applying logic-driven data augmentation for fine-tuning and prompting can enhance generalisation in both discriminative and generative models, offering a path to improving their robustness for tasks involving logical reasoning. Source code and data are made publicly available at https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.
Paper Structure (18 sections, 2 figures, 5 tables)

This paper contains 18 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The instruction fine-tuning involves providing the model with a task description before the input. It includes the Instruction, Input, and Question. The model then gives the expected output. The correct answer is highlighted in blue with a checkmark. Each question has four choices, and only one of them is the correct answer.
  • Figure 2: We conduct the IFT and IPT on generative (blue square) and discriminative (cyan square) language models, testing them on MCQA datasets (green circles). These datasets are modified by shuffling options and replacing answers.