Table of Contents
Fetching ...

Meissa: Multi-modal Medical Agentic Intelligence

Yixiong Chen, Xinyi Bai, Yue Pan, Zongwei Zhou, Alan Yuille

TL;DR

Meissa is a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline with 22x lower end-to-end latency compared to API-based deployment and proposes Unified trajectory modeling, which allows one model to generalize across heterogeneous medical environments.

Abstract

Multi-modal large language models (MM-LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi-agent collaboration, enabling complex decision-making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API-based deployment incurs high cost, high latency, and privacy risks that conflict with on-premise clinical requirements. We present Meissa, a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi-step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state-action-observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three-tier stratified supervision: the model's own errors trigger progressive escalation from direct reasoning to tool-augmented and multi-agent interaction, explicitly learning difficulty-aware strategy selection. (3) Prospective-retrospective supervision: pairing exploratory forward traces with hindsight-rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini-3, Meissa operates fully offline with 22x lower end-to-end latency compared to API-based deployment. Data, models, and environments are released at https://github.com/Schuture/Meissa.

Meissa: Multi-modal Medical Agentic Intelligence

TL;DR

Meissa is a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline with 22x lower end-to-end latency compared to API-based deployment and proposes Unified trajectory modeling, which allows one model to generalize across heterogeneous medical environments.

Abstract

Multi-modal large language models (MM-LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi-agent collaboration, enabling complex decision-making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API-based deployment incurs high cost, high latency, and privacy risks that conflict with on-premise clinical requirements. We present Meissa, a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi-step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state-action-observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three-tier stratified supervision: the model's own errors trigger progressive escalation from direct reasoning to tool-augmented and multi-agent interaction, explicitly learning difficulty-aware strategy selection. (3) Prospective-retrospective supervision: pairing exploratory forward traces with hindsight-rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini-3, Meissa operates fully offline with 22x lower end-to-end latency compared to API-based deployment. Data, models, and environments are released at https://github.com/Schuture/Meissa.
Paper Structure (73 sections, 5 equations, 11 figures, 27 tables, 1 algorithm)

This paper contains 73 sections, 5 equations, 11 figures, 27 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of Meissa: Trajectory-based agentic behavior distillation.Left: Stratified trajectory supervision uses the model's own errors to progressively escalate interaction depth, teaching strategy selection. Center: Four agent environments serve as diverse trajectory sources. Right: Prospective-retrospective supervision teaches both exploration and optimal execution policies.
  • Figure 2: Four agent environments as trajectory sources. Each environment produces trajectories with distinct state--action--observation patterns: (a) tool calling trajectories with vision tool chains, (b) interleaved image-text trajectories with visual feedback loops, (c) multi-agent trajectories with expert debate and synthesis, (d) clinical simulation trajectories with multi-turn information gathering.
  • Figure 3: Strategy selection analysis. (Left) Tier 1 easy queries are answered directly in 96% of cases, while Tier 3 hard queries trigger agentic interaction 97% of the time, confirming difficulty-aware routing. (Center) Meissa accuracy peaks near 2,000 tokens / query and then drops at 4,000 tokens (capacity limit of the 4B model), whereas frontier models scale monotonically. This motivates depth allocation for lightweight models. (Right) Depth-constrained accuracy ($T_{\max}\in\{0,1,2,3,\infty\}$): accuracy improves consistently with interaction depth but exhibits diminishing returns beyond $T_{\max}{=}3$.
  • Figure 4: Case study. Each panel shows a query with Meissa's reasoning trace. (a)$T{=}0$: bilateral nodular infiltrates with cavitation are directly recognizable; no tool is invoked. (b)$T{=}1$: the model calls chest_xray_expert to confirm mediastinal air before diagnosing pneumomediastinum. (c)$T{=}3$: the report generator misses the opacity, but the expert and phrase-grounding tools correctly identify a dense mass (red bounding box); the model reconciles the conflicting outputs. (d)$T{=}3$, conflict resolution: llava_med_qa hallucinates a pulse oximeter and ventilator (red highlight), while two other tools confirm a clean CXR; the model identifies the hallucinated output. (e)$T{=}4$, progressive diagnosis: BiomedParse failed to capture the target regions, so Meissa actively zoom in the image to confirm the findings.
  • Figure 5: Per-query latency distributions on ChestAgentBench. (a) Meissa completes the majority of queries in under 3 seconds; the long tail corresponds to queries invoking multiple tools. (b) Gemini-3-flash + MedRAX averages 87.2s per query due to multiple API calls and remote tool execution, resulting in ${\sim}22\times$ higher latency than Meissa. Note the different $x$-axis scales.
  • ...and 6 more figures