Table of Contents
Fetching ...

CARE: Privacy-Compliant Agentic Reasoning with Evidence Discordance

Haochen Liu, Weien Li, Rui Song, Zeyu Li, Chun Jason Xue, Xiao-Yang Liu, Sam Nallaperuma, Xue Liu, Ye Yuan

Abstract

Large language model (LLM) systems are increasingly used to support high-stakes decision-making, but they typically perform worse when the available evidence is internally inconsistent. Such a scenario exists in real-world healthcare settings, with patient-reported symptoms contradicting medical signs. To study this problem, we introduce MIMIC-DOS, a dataset for short-horizon organ dysfunction worsening prediction in the intensive care unit (ICU) setting. We derive this dataset from the widely recognized MIMIC-IV, a publicly available electronic health record dataset, and construct it exclusively from cases in which discordance between signs and symptoms exists. This setting poses a substantial challenge for existing LLM-based approaches, with single-pass LLMs and agentic pipelines often struggling to reconcile such conflicting signals. To address this problem, we propose CARE: a multi-stage privacy-compliant agentic reasoning framework in which a remote LLM provides guidance by generating structured categories and transitions without accessing sensitive patient data, while a local LLM uses these categories and transitions to support evidence acquisition and final decision-making. Empirically, CARE achieves stronger performance across all key metrics compared to multiple baseline settings, showing that CARE can more robustly handle conflicting clinical evidence while preserving privacy.

CARE: Privacy-Compliant Agentic Reasoning with Evidence Discordance

Abstract

Large language model (LLM) systems are increasingly used to support high-stakes decision-making, but they typically perform worse when the available evidence is internally inconsistent. Such a scenario exists in real-world healthcare settings, with patient-reported symptoms contradicting medical signs. To study this problem, we introduce MIMIC-DOS, a dataset for short-horizon organ dysfunction worsening prediction in the intensive care unit (ICU) setting. We derive this dataset from the widely recognized MIMIC-IV, a publicly available electronic health record dataset, and construct it exclusively from cases in which discordance between signs and symptoms exists. This setting poses a substantial challenge for existing LLM-based approaches, with single-pass LLMs and agentic pipelines often struggling to reconcile such conflicting signals. To address this problem, we propose CARE: a multi-stage privacy-compliant agentic reasoning framework in which a remote LLM provides guidance by generating structured categories and transitions without accessing sensitive patient data, while a local LLM uses these categories and transitions to support evidence acquisition and final decision-making. Empirically, CARE achieves stronger performance across all key metrics compared to multiple baseline settings, showing that CARE can more robustly handle conflicting clinical evidence while preserving privacy.

Paper Structure

This paper contains 36 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparison of three decision-making settings: (1) single-pass inference with a proprietary LLM (left, light blue), (2) single-pass inference with a local LLM (middle, light yellow), and (3) our proposed CARE framework (right, light red). In the single-pass settings, the model receives raw patient values and feature columns directly as input. Using a proprietary LLM in this way risks privacy leakage, while relying only on a self-hosted local LLM can lead to poorer decisions. In contrast, CARE enables the proprietary LLM to provide structured guidance to the local LLM without accessing raw patient values, allowing privacy-compliant decision making while preserving strong performance.
  • Figure 2: Overview of the CARE framework. In Stage 1, a proprietary LLM constructs a rubric schema over intermediate patient states from task information, and the local side applies this rubric to patient data to obtain an initial state assignment. In Stage 2, the local LLM performs evidence checks by determining whether additional features are needed given the current state and observed values. In Stage 3, the proprietary LLM generates transition guidance from an abstract view of the current state and available feature types; the local side updates the state through recomputation and constrained merge without exposing raw patient data. In Stage 4, the local LLM produces the final task decision from the patient data and the accumulated reasoning trace over states. In our framework, raw patient values remain local throughout the entire pipeline, allowing the framework to preserve privacy.
  • Figure 3: Held-out UMAP geometry of MIMIC-DOS. The x- and y-axes are the two coordinates of a UMAP space fit on a separate training cohort and used to project MIMIC-DOS for visualization. They do not correspond to individual clinical variables. Left: local positive prevalence estimated from the ground-truth labels, where GT 0 denotes the negative class and GT 1 denotes the positive class. Right: local neighborhood purity. Persistent label mixing across both panels highlights the strong overlap structure of the benchmark.