Chain of Logic: Rule-Based Reasoning with Large Language Models

Sergio Servantez; Joe Barrow; Kristian Hammond; Rajiv Jain

Chain of Logic: Rule-Based Reasoning with Large Language Models

Sergio Servantez, Joe Barrow, Kristian Hammond, Rajiv Jain

TL;DR

The paper tackles the challenge of rule-based, compositional legal reasoning by evaluating large language models (LMs) and introducing Chain of Logic, a prompting framework that decomposes rules into individual elements, reasons about them separately, and recombines the results to resolve complex logical expressions. Built on IRAC-inspired principles, Chain of Logic yields interpretable, stepwise reasoning traces and supports debugging of incorrect conclusions. Across eight LegalBench tasks and multiple language models (including GPT-3.5/4 and open-source variants), this method consistently outperforms chain-of-thought, self-ask, and other baselines in a single-demo, different-rule setting, reducing the need for extensive per-rule demonstrations. The approach holds promise for legal-domain AI applications, potentially enabling better reasoning, easier instruction tuning, and reduced reliance on large annotated datasets, with future extensions to multi-pass and retrieval-augmented strategies.

Abstract

Rule-based reasoning, a fundamental type of legal reasoning, enables us to draw conclusions by accurately applying a rule to a set of facts. We explore causal language models as rule-based reasoners, specifically with respect to compositional rules - rules consisting of multiple elements which form a complex logical expression. Reasoning about compositional rules is challenging because it requires multiple reasoning steps, and attending to the logical relationships between elements. We introduce a new prompting method, Chain of Logic, which elicits rule-based reasoning through decomposition (solving elements as independent threads of logic), and recomposition (recombining these sub-answers to resolve the underlying logical expression). This method was inspired by the IRAC (Issue, Rule, Application, Conclusion) framework, a sequential reasoning approach used by lawyers. We evaluate chain of logic across eight rule-based reasoning tasks involving three distinct compositional rules from the LegalBench benchmark and demonstrate it consistently outperforms other prompting methods, including chain of thought and self-ask, using open-source and commercial language models.

Chain of Logic: Rule-Based Reasoning with Large Language Models

TL;DR

Abstract

Paper Structure (16 sections, 5 figures, 6 tables)

This paper contains 16 sections, 5 figures, 6 tables.

Introduction
Background
Chain of Logic
Experiments
Tasks
Baseline Methods
Language Models
Results and Discussion
Diversity Jurisdiction Series
Ablation Study
Related Work
Limitations and Future Work
Conclusion
Appendix
LegalBench Task Examples
...and 1 more sections

Figures (5)

Figure 1: Example showing compositional structure of rule for Personal Jurisdiction task from LegalBench. Color coding is used to identify rule elements and illustrate how these elements form a complex logical expression. Reasoning about compositional rules requires not only correctly applying each element to a fact pattern, but also resolving the logical expression. If the logical expression evaluates to true, it triggers a consequence (personal jurisdiction exists).
Figure 2: Comparing one-shot examples demonstrating chain of thought, self-ask and our chain of logic method on the Personal Jurisdiction task. Through a sequence of reasoning steps, chain of logic decomposes a rule into elements which are solved independently, before recomposing sub-answers to arrive at a final conclusion. See Section \ref{['chain_of_logic']} for a detailed discussion on the chain of logic approach.
Figure 3: Comparing GPT-3.5 output for chain of thought, self-ask and our chain of logic method on the same Personal Jurisdiction task. Chain of logic prompting elicits large language models to reason about complex rules while also constructing an interpretable reasoning path. The prompts here are abridged, omitting a one-shot example for each method from the Diversity Jurisdiction task (see Section \ref{['tasks']}).
Figure 4: Accuracy (%) across all 6 Diversity Jurisdiction tasks using GPT-3.5. The fact patterns in these tasks are increasingly complex, from DJ1 (easiest) to DJ6 (hardest). Chain of logic particularly outperforms other prompting methods for tasks requiring arithmetic operations (DJ3, DJ5, DJ6). See Section \ref{['diversity_series']} for a detailed discussion.
Figure 5: GPT-4 output demonstrating common errors in one-shot methods.

Chain of Logic: Rule-Based Reasoning with Large Language Models

TL;DR

Abstract

Chain of Logic: Rule-Based Reasoning with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)