Transformer-based Language Models for Reasoning in the Description Logic ALCQ

Angelos Poulis; Eleni Tsalapati; Manolis Koubarakis

Transformer-based Language Models for Reasoning in the Description Logic ALCQ

Angelos Poulis, Eleni Tsalapati, Manolis Koubarakis

TL;DR

The paper addresses the challenge of evaluating transformer-based models on entailment tasks over expressive Description Logic contexts, specifically $ $ALCQ$ under open-world semantics. It introduces DELTA$_D$, a 384K NL dataset generated via probabilistic grammars that vary reasoning depth $\mathcal{D}$ and linguistic complexity $\mathcal{L}$, and translates DL axioms to natural language. A DeBERTa-based model trained on DELTA$_D$ ($\Delta_M$) achieves near-perfect accuracy across depths and complexities, while GPT-3.5/4 show strong few-shot performance but less robustness at higher depths; zero-shot tests reveal varying degrees of generalization, and symbolic variants reveal limits of semantics-driven learning. The work provides open-source data and demonstrates practical potential for scalable reasoning in real-world diagnostics and beyond, while outlining future work on expanding expressivity and broader model comparisons.

Abstract

Recent advancements in transformer-based language models have sparked research into their logical reasoning capabilities. Most of the benchmarks used to evaluate these models are simple: generated from short (fragments of) first-order logic sentences with only a few logical operators and quantifiers. We construct the natural language dataset, DELTA$_D$, using the expressive description logic language $\mathcal{ALCQ}$. DELTA$_D$ comprises 384K examples and increases in two dimensions: i) reasoning depth, and ii) linguistic complexity. In this way, we systematically investigate the logical reasoning capabilities of a supervised fine-tuned DeBERTa-based model and two large language models (GPT-3.5, GPT-4) with few-shot prompting. We show that the DeBERTa-based model fine-tuned on our dataset can master the entailment checking task. Moreover, the performance of GPTs can improve significantly even when a small number of samples is provided (9 shots). We open-source our code and datasets.

Transformer-based Language Models for Reasoning in the Description Logic ALCQ

TL;DR

The paper addresses the challenge of evaluating transformer-based models on entailment tasks over expressive Description Logic contexts, specifically

ALCQ

\mathcal{D}

\mathcal{L}

\Delta_M$) achieves near-perfect accuracy across depths and complexities, while GPT-3.5/4 show strong few-shot performance but less robustness at higher depths; zero-shot tests reveal varying degrees of generalization, and symbolic variants reveal limits of semantics-driven learning. The work provides open-source data and demonstrates practical potential for scalable reasoning in real-world diagnostics and beyond, while outlining future work on expanding expressivity and broader model comparisons.

Abstract

, using the expressive description logic language

. DELTA

comprises 384K examples and increases in two dimensions: i) reasoning depth, and ii) linguistic complexity. In this way, we systematically investigate the logical reasoning capabilities of a supervised fine-tuned DeBERTa-based model and two large language models (GPT-3.5, GPT-4) with few-shot prompting. We show that the DeBERTa-based model fine-tuned on our dataset can master the entailment checking task. Moreover, the performance of GPTs can improve significantly even when a small number of samples is provided (9 shots). We open-source our code and datasets.

Transformer-based Language Models for Reasoning in the Description Logic ALCQ

TL;DR

Abstract

Transformer-based Language Models for Reasoning in the Description Logic ALCQ

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)