Multi-step Inference over Unstructured Data

Aditya Kalyanpur; Kailash Karthik Saravanakumar; Victor Barres; CJ McFate; Lori Moon; Nati Seifu; Maksim Eremeev; Jose Barrera; Abraham Bautista-Castillo; Eric Brown; David Ferrucci

Multi-step Inference over Unstructured Data

Aditya Kalyanpur, Kailash Karthik Saravanakumar, Victor Barres, CJ McFate, Lori Moon, Nati Seifu, Maksim Eremeev, Jose Barrera, Abraham Bautista-Castillo, Eric Brown, David Ferrucci

TL;DR

This work addresses the difficulty of high-stakes, multi-step inference over heterogeneous data by proposing a neuro-symbolic platform that blends fine-tuned LLMs for knowledge extraction with a robust symbolic reasoning engine. The Cora Collaborative Research Assistant demonstrates how Cogent-based knowledge representations, an Evidenced Graph Builder, and ASP-driven reasoning enable precise, explainable, and verifiable multi-hop inferences across life sciences and macroeconomics. Extensive medical-domain evaluation shows Cora delivering the most reliable, citation-backed, and contextually rich answers compared with LLM-only and RAG baselines, including strong performance on multi-hop queries. The approach advances practical AI for domain research by enabling interactive counterfactual analysis and executable causal maps grounded in diverse unstructured data sources.

Abstract

The advent of Large Language Models (LLMs) and Generative AI has revolutionized natural language applications across various domains. However, high-stakes decision-making tasks in fields such as medical, legal and finance require a level of precision, comprehensiveness, and logical consistency that pure LLM or Retrieval-Augmented-Generation (RAG) approaches often fail to deliver. At Elemental Cognition (EC), we have developed a neuro-symbolic AI platform to tackle these problems. The platform integrates fine-tuned LLMs for knowledge extraction and alignment with a robust symbolic reasoning engine for logical inference, planning and interactive constraint solving. We describe Cora, a Collaborative Research Assistant built on this platform, that is designed to perform complex research and discovery tasks in high-stakes domains. This paper discusses the multi-step inference challenges inherent in such domains, critiques the limitations of existing LLM-based methods, and demonstrates how Cora's neuro-symbolic approach effectively addresses these issues. We provide an overview of the system architecture, key algorithms for knowledge extraction and formal reasoning, and present preliminary evaluation results that highlight Cora's superior performance compared to well-known LLM and RAG baselines.

Multi-step Inference over Unstructured Data

TL;DR

Abstract

Paper Structure (21 sections, 10 figures, 4 tables)

This paper contains 21 sections, 10 figures, 4 tables.

Introduction
Multi-Step Inference Use-Cases
Life Science Research: Drug Discovery and Re-purposing
Macro-Economic Analysis: Multivariate Causal Inference
Neuro-Symbolic AI Platform
High-Level Architecture
Knowledge Extraction using Statistical Models
Multi-hop QA and Explanations using Symbolic Reasoning
Cogent: KR Language and Meta-Model
Evidenced Graph Building and Symbolic Reasoning
Initial Evaluation
Medical QA Eval
Discussion
Results on Representative Queries data
Results on Multi-hop Queries data
...and 6 more sections

Figures (10)

Figure 1: Using ChatGPT for Medical Research. There are four main classes of problems (1) No control over the search process, filtering or ranking of results; (2) Inability to validate without cross-checking references - here, the paper exists but it does not contain evidence justifying the claim; (3) Hallucinated references - this citation is made up; (4) Cannot guarantee completeness - inability to find needles in the haystack
Figure 2: Elicit's answer to the question linking IRAK4 and RA
Figure 3: Cora's analysis for the IRAK4-RA question. Cora extracts a detailed model linking RA and IRAK4 inhibitors based on a generalized research template, and produces a structured report with claims, evidence and citations
Figure 4: GPT4's response to the macro-economic question:If economic growth is falling in an Emerging Market country, and the country is facing high inflation, what is the likely impact on nominal bond yields?
Figure 5: Cora's response to the question on nominal bond yields. Cora extracts a scenario relevant causal map "on-the-fly" from the corpus and does precise causal inference to compute the final result. Blue edges in the graph indicate upward pressure on the target node, while red edges indicate downward pressure. Similarly, the node color being blue or red depicts whether the quantity is increasing or decreasing respectively. The graph is fully interactive and the user can alter edge weights, add or remove nodes/edges and redo the causal inference on the fly.
...and 5 more figures

Multi-step Inference over Unstructured Data

TL;DR

Abstract

Multi-step Inference over Unstructured Data

Authors

TL;DR

Abstract

Table of Contents

Figures (10)