Table of Contents
Fetching ...

Transformer-based Language Models for Reasoning in the Description Logic ALCQ

Angelos Poulis, Eleni Tsalapati, Manolis Koubarakis

TL;DR

The paper addresses the challenge of evaluating transformer-based models on entailment tasks over expressive Description Logic contexts, specifically $ $ALCQ$ under open-world semantics. It introduces DELTA$_D$, a 384K NL dataset generated via probabilistic grammars that vary reasoning depth $\mathcal{D}$ and linguistic complexity $\mathcal{L}$, and translates DL axioms to natural language. A DeBERTa-based model trained on DELTA$_D$ ($\Delta_M$) achieves near-perfect accuracy across depths and complexities, while GPT-3.5/4 show strong few-shot performance but less robustness at higher depths; zero-shot tests reveal varying degrees of generalization, and symbolic variants reveal limits of semantics-driven learning. The work provides open-source data and demonstrates practical potential for scalable reasoning in real-world diagnostics and beyond, while outlining future work on expanding expressivity and broader model comparisons.

Abstract

Recent advancements in transformer-based language models have sparked research into their logical reasoning capabilities. Most of the benchmarks used to evaluate these models are simple: generated from short (fragments of) first-order logic sentences with only a few logical operators and quantifiers. We construct the natural language dataset, DELTA$_D$, using the expressive description logic language $\mathcal{ALCQ}$. DELTA$_D$ comprises 384K examples and increases in two dimensions: i) reasoning depth, and ii) linguistic complexity. In this way, we systematically investigate the logical reasoning capabilities of a supervised fine-tuned DeBERTa-based model and two large language models (GPT-3.5, GPT-4) with few-shot prompting. We show that the DeBERTa-based model fine-tuned on our dataset can master the entailment checking task. Moreover, the performance of GPTs can improve significantly even when a small number of samples is provided (9 shots). We open-source our code and datasets.

Transformer-based Language Models for Reasoning in the Description Logic ALCQ

TL;DR

The paper addresses the challenge of evaluating transformer-based models on entailment tasks over expressive Description Logic contexts, specifically ALCQ_D\mathcal{D}\mathcal{L}_D\Delta_M$) achieves near-perfect accuracy across depths and complexities, while GPT-3.5/4 show strong few-shot performance but less robustness at higher depths; zero-shot tests reveal varying degrees of generalization, and symbolic variants reveal limits of semantics-driven learning. The work provides open-source data and demonstrates practical potential for scalable reasoning in real-world diagnostics and beyond, while outlining future work on expanding expressivity and broader model comparisons.

Abstract

Recent advancements in transformer-based language models have sparked research into their logical reasoning capabilities. Most of the benchmarks used to evaluate these models are simple: generated from short (fragments of) first-order logic sentences with only a few logical operators and quantifiers. We construct the natural language dataset, DELTA, using the expressive description logic language . DELTA comprises 384K examples and increases in two dimensions: i) reasoning depth, and ii) linguistic complexity. In this way, we systematically investigate the logical reasoning capabilities of a supervised fine-tuned DeBERTa-based model and two large language models (GPT-3.5, GPT-4) with few-shot prompting. We show that the DeBERTa-based model fine-tuned on our dataset can master the entailment checking task. Moreover, the performance of GPTs can improve significantly even when a small number of samples is provided (9 shots). We open-source our code and datasets.

Paper Structure

This paper contains 31 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: An example from DELTA$_D$, where the context contains three sentences of high linguistic complexity level and the true and false sentences are of reasoning depth $3$.
  • Figure 2: Data generation pipeline for examples with $n$-level context and answers of minimum inference depth $\leq m$