Table of Contents
Fetching ...

LLMs can construct powerful representations and streamline sample-efficient supervised learning

Ilker Demirel, Larry Shi, Zeshan Hussain, David Sontag

TL;DR

Across 15 clinical tasks from the EHRSHOT benchmark, rubric-based approaches significantly outperform traditional count-feature models, naive text-serialization-based LLM baselines, and a clinical foundation model, which is pretrained on orders of magnitude more data.

Abstract

As real-world datasets become increasingly complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data for downstream tasks, such as time-series, free text, and structured records, often requires non-trivial domain-specific engineering. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric-based approaches significantly outperform traditional count-feature models, naive text-serialization-based LLM baselines, and a clinical foundation model, which is pretrained on orders of magnitude more data. Beyond performance, rubrics offer several advantages for operational healthcare settings such as being easy to audit, cost-effectiveness to deploy at scale, and they can be converted to tabular representations that unlock a swath of machine learning techniques.

LLMs can construct powerful representations and streamline sample-efficient supervised learning

TL;DR

Across 15 clinical tasks from the EHRSHOT benchmark, rubric-based approaches significantly outperform traditional count-feature models, naive text-serialization-based LLM baselines, and a clinical foundation model, which is pretrained on orders of magnitude more data.

Abstract

As real-world datasets become increasingly complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data for downstream tasks, such as time-series, free text, and structured records, often requires non-trivial domain-specific engineering. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric-based approaches significantly outperform traditional count-feature models, naive text-serialization-based LLM baselines, and a clinical foundation model, which is pretrained on orders of magnitude more data. Beyond performance, rubrics offer several advantages for operational healthcare settings such as being easy to audit, cost-effectiveness to deploy at scale, and they can be converted to tabular representations that unlock a swath of machine learning techniques.
Paper Structure (51 sections, 1 equation, 17 figures, 20 tables)

This paper contains 51 sections, 1 equation, 17 figures, 20 tables.

Figures (17)

  • Figure 1: Performance averaged over all 15 clinical prediction tasks in the EHRSHOT benchmark with 6,739 patients wornow2023ehrshot. Our rubric-style representations agentically constructed by LLMs outperform naive text-serialization-based LLM baseline in hegselmann2025large, as well as a clinical foundation model pretrained on 2.57M patients (CLMBR-T, wornow2023ehrshot), and a count feature-based gradient boosting machine (Count-GBM, ke2017lightgbmwornow2023ehrshot).
  • Figure 2: Synthetic electronic health record (EHR) representation examples, focusing on the acute myocardial infarction (acute MI) prediction task. Left. Naive text-serialization adopted from hegselmann2025large. Middle. Local rubric representation which is a task-conditioned summary of the naive text-serialization. Right. Global rubric transformed version of the naive-text serialization.
  • Figure 3: Agentic global-rubric pipeline for EHRSHOT tasks.(A) Build a label-balanced and diverse patient set via $k$-means. (B) Patient EHRs are fed to an LLM which is prompted to synthesize a task rubric. (C) The LLM outputs a systematic rubric $\mathcal{R}$ that defines how to transform any patient EHR from naive text ($x_{\text{text}}$) to textual rubric representation ($x_{\text{rubric}}$). (D) An LLM is asked to transform $x_{\text{text}}$ to $x_{\text{rubric}}$ for each patient. (E) An LLM is asked to write a script to automate the transformation step in Panel (D). (F) An LLM is asked to write a script to transform rubric representations $x_{\text{rubric}}$ into tabular features. Full prompts are provided in Appendix \ref{['app:agent_prompts']}.
  • Figure 4: Prompts used for generating local rubric representations. Left. Prompt for generating task-conditioned local rubric summaries. Right. Prompt for generating generic local rubric summaries (ablation).
  • Figure 5: Prompt for converting textual inputs to embeddings. An example task query: "Will the patient develop lupus within next year?"
  • ...and 12 more figures