Table of Contents
Fetching ...

Picking the Right Specialist: Attentive Neural Process-based Selection of Task-Specialized Models as Tools for Agentic Healthcare Systems

Pramit Saha, Joshua Strong, Mohammad Alsharid, Divyanshu Mishra, J. Alison Noble

TL;DR

This work introduces ToolSelect, which adaptively learns model selection for tools by minimizing a population risk over sampled specialist tool candidates using a consistent surrogate of the task-conditional selection loss, and proposes an Attentive Neural Process-based selector conditioned on the query and per-model behavioral summaries to choose among the specialist models.

Abstract

Task-specialized models form the backbone of agentic healthcare systems, enabling the agents to answer clinical queries across tasks such as disease diagnosis, localization, and report generation. Yet, for a given task, a single "best" model rarely exists. In practice, each task is better served by multiple competing specialist models where different models excel on different data samples. As a result, for any given query, agents must reliably select the right specialist model from a heterogeneous pool of tool candidates. To this end, we introduce ToolSelect, which adaptively learns model selection for tools by minimizing a population risk over sampled specialist tool candidates using a consistent surrogate of the task-conditional selection loss. Concretely, we propose an Attentive Neural Process-based selector conditioned on the query and per-model behavioral summaries to choose among the specialist models. Motivated by the absence of any established testbed, we, for the first time, introduce an agentic Chest X-ray environment equipped with a diverse suite of task-specialized models (17 disease detection, 19 report generation, 6 visual grounding, and 13 VQA) and develop ToolSelectBench, a benchmark of 1448 queries. Our results demonstrate that ToolSelect consistently outperforms 10 SOTA methods across four different task families.

Picking the Right Specialist: Attentive Neural Process-based Selection of Task-Specialized Models as Tools for Agentic Healthcare Systems

TL;DR

This work introduces ToolSelect, which adaptively learns model selection for tools by minimizing a population risk over sampled specialist tool candidates using a consistent surrogate of the task-conditional selection loss, and proposes an Attentive Neural Process-based selector conditioned on the query and per-model behavioral summaries to choose among the specialist models.

Abstract

Task-specialized models form the backbone of agentic healthcare systems, enabling the agents to answer clinical queries across tasks such as disease diagnosis, localization, and report generation. Yet, for a given task, a single "best" model rarely exists. In practice, each task is better served by multiple competing specialist models where different models excel on different data samples. As a result, for any given query, agents must reliably select the right specialist model from a heterogeneous pool of tool candidates. To this end, we introduce ToolSelect, which adaptively learns model selection for tools by minimizing a population risk over sampled specialist tool candidates using a consistent surrogate of the task-conditional selection loss. Concretely, we propose an Attentive Neural Process-based selector conditioned on the query and per-model behavioral summaries to choose among the specialist models. Motivated by the absence of any established testbed, we, for the first time, introduce an agentic Chest X-ray environment equipped with a diverse suite of task-specialized models (17 disease detection, 19 report generation, 6 visual grounding, and 13 VQA) and develop ToolSelectBench, a benchmark of 1448 queries. Our results demonstrate that ToolSelect consistently outperforms 10 SOTA methods across four different task families.
Paper Structure (18 sections, 10 equations, 6 figures, 4 tables)

This paper contains 18 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Why model selection for tools matters in agentic systems?Top (a): A naive setup with a single fixed specialist model per tool fails under domain and label-space mismatch. Bottom (b): Our approach equips tools with multiple specialist candidates and uses ToolSelect to adaptively select appropriate model for a given query, yielding more accurate clinical answers than (a).
  • Figure 2: Performance comparison of ToolSelect against 18 task-specialized chest X-ray disease-detection models. Individual specialists exhibit low performance and substantial performance variability. The Oracle upper bound indicates that selecting the appropriate specialist per case can yield large gains, whereas Random selection often degrades performance, sometimes below that of individual models. ToolSelect consistently improves over individual specialists and random selection, aiming to close the gap to the Oracle.
  • Figure 3: Architecture of our agentic framework. The system follows a ReAct-style loop, combining short-term memory (LangChain) with a heterogeneous tool model zoo incorporating fundamental Chest X-Ray-based tasks required to process user queries. Our proposed ToolSelect module is integrated to perform query-conditioned tool candidate (i.e., specialist model) selection.
  • Figure 4: Overview of the proposed ToolSelect architecture. Given a multimodal query (image, text, and task), query features are fused and used to attend over task-specific per-tool reference sets that summarize each candidate tool’s empirical behavior. Cross-attention produces a query-conditioned representation for each tool, which is combined with the tool’s prediction and passed to a selector network. Tools that do not support the queried task are masked out before producing the final tool selection probability distribution.
  • Figure 5: Qualitative report-generation comparison on minor (top) and major (bottom) anomaly cases (GT: ground truth). ToolSelect (ours) routes each query to the appropriate specialist, producing clinically aligned reports that emphasize the correct abnormal regions and severity. Baseline routers often hallucinate severity in minor cases or under-report major findings, yielding generic/misleading text; in the bottom case they all default to the globally strong CheXagent-8B and fail, whereas ToolSelect selects the better-suited CheXpert Plus model (despite lower average performance) and succeeds, highlighting the benefit of query-guided selection.
  • ...and 1 more figures