Lost in the Logic: An Evaluation of Large Language Models' Reasoning Capabilities on LSAT Logic Games

Saumya Malik

Lost in the Logic: An Evaluation of Large Language Models' Reasoning Capabilities on LSAT Logic Games

Saumya Malik

TL;DR

This thesis constructs a dataset of LSAT logic games and their associated metadata, and extensively evaluates LLMs' performance in a Chain-of-Thought prompting setting, highlighting the capacity of LLMs to revise their logical errors, despite initially weak performance.

Abstract

In this thesis, I evaluate the performance of Large Language Models (LLMs) on the Law School Admissions Test (LSAT), specifically the Logic Games section of the test. I focus on this section because it presents a complex logical reasoning task and thus is a valuable source of data for evaluating how modern, increasingly capable LLMs can handle hard logical reasoning tasks. I construct a dataset of LSAT logic games and their associated metadata, and extensively evaluate LLMs' performance in a Chain-of-Thought prompting setting. Given the weak performance in this setting, I explore other prompting frameworks on a smaller subset of the dataset, adapting ideas from Reflexion to this task. This results in a substantially improved accuracy of 70 percent for GPT-4 and 46 percent for GPT-3.5 on this data subset, highlighting the capacity of LLMs to revise their logical errors, despite initially weak performance. Finally, I analyze the types of logic games that models perform better or worse on, as well as the types of logical errors I observe from human annotation, providing detailed insights on the logical reasoning capabilities of LLMs.

Lost in the Logic: An Evaluation of Large Language Models' Reasoning Capabilities on LSAT Logic Games

TL;DR

Abstract

Paper Structure (45 sections, 8 figures, 6 tables)

This paper contains 45 sections, 8 figures, 6 tables.

Introduction
Background and Related Work
LSAT Logic Games
Prompting
Chain-of-Thought Prompting
Related Work
Logical Reasoning Datasets
LSAT and Language Models
Dataset Construction and Implementation
Dataset Construction
Logic Game Difficulty and Problem Difficulty
Logic Game Type
In-and-Out Games
Sequence Games
Grouping Games
...and 30 more sections

Figures (8)

Figure 1: Sample Explanation, Adapted from Khan Academy
Figure 2: Walkthrough of a Successful GPT-4 Self-Reflection
Figure 3: Accuracy by Problem Difficulty
Figure 4: Accuracy by Game Type
Figure 5: Accuracy by Game Type on Difficulty 3 Problems
...and 3 more figures

Lost in the Logic: An Evaluation of Large Language Models' Reasoning Capabilities on LSAT Logic Games

TL;DR

Abstract

Lost in the Logic: An Evaluation of Large Language Models' Reasoning Capabilities on LSAT Logic Games

Authors

TL;DR

Abstract

Table of Contents

Figures (8)