Table of Contents
Fetching ...

Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning

Zheang Huai, Honglong Yang, Xiaomeng Li

TL;DR

This paper focuses on chest X-ray analysis and presents a tool-expertise-aware chest X-ray agent (TEA-CXA), a framework that enables an agent to interact with tools and empirically learn their practical trustworthiness across different types of multimodal queries via agentic learning.

Abstract

AI agents with tool-use capabilities show promise for integrating the domain expertise of various tools. In the medical field, however, tools are usually AI models that are inherently error-prone and can produce contradictory responses. Existing research on medical agents lacks sufficient understanding of the tools' realistic reliability and thus cannot effectively resolve tool conflicts. To address this gap, this paper introduces a framework that enables an agent to interact with tools and empirically learn their practical trustworthiness across different types of multimodal queries via agentic learning. As a concrete instantiation, we focus on chest X-ray analysis and present a tool-expertise-aware chest X-ray agent (TEA-CXA). When tool outputs disagree, the agent experimentally accepts or rejects multimodal tool results, receives rewards, and learns which tool to trust for each query type. Importantly, TEA-CXA extends existing codebases for reinforcement learning with multi-turn tool-calling that focus on textual inputs, to support multimodal contexts effectively. In addition, we enhance the codebase for medical use scenarios by supporting multiple tool calls in one turn, parallel tool inference, and multi-image accommodation within a single user query. Our code framework is applicable to general medical research on multi-turn tool-calling reinforcement learning in multimodal settings. Experiments show that TEA-CXA outperforms the state-of-the-art methods and a comprehensive set of baselines. Code will be released.

Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning

TL;DR

This paper focuses on chest X-ray analysis and presents a tool-expertise-aware chest X-ray agent (TEA-CXA), a framework that enables an agent to interact with tools and empirically learn their practical trustworthiness across different types of multimodal queries via agentic learning.

Abstract

AI agents with tool-use capabilities show promise for integrating the domain expertise of various tools. In the medical field, however, tools are usually AI models that are inherently error-prone and can produce contradictory responses. Existing research on medical agents lacks sufficient understanding of the tools' realistic reliability and thus cannot effectively resolve tool conflicts. To address this gap, this paper introduces a framework that enables an agent to interact with tools and empirically learn their practical trustworthiness across different types of multimodal queries via agentic learning. As a concrete instantiation, we focus on chest X-ray analysis and present a tool-expertise-aware chest X-ray agent (TEA-CXA). When tool outputs disagree, the agent experimentally accepts or rejects multimodal tool results, receives rewards, and learns which tool to trust for each query type. Importantly, TEA-CXA extends existing codebases for reinforcement learning with multi-turn tool-calling that focus on textual inputs, to support multimodal contexts effectively. In addition, we enhance the codebase for medical use scenarios by supporting multiple tool calls in one turn, parallel tool inference, and multi-image accommodation within a single user query. Our code framework is applicable to general medical research on multi-turn tool-calling reinforcement learning in multimodal settings. Experiments show that TEA-CXA outperforms the state-of-the-art methods and a comprehensive set of baselines. Code will be released.
Paper Structure (6 sections, 3 equations, 4 figures, 2 tables)

This paper contains 6 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: (a)(b) Previous works use tools in a zero-shot manner or fine-tune the policy model with pre-specified tool-use traces, and thus lack sufficient understanding of the tools' realistic reliability and cannot effectively resolve tool conflicts. (c) Our approach enables the agent to learn tools' empirical trustworthiness across different queries through multimodal agentic learning. (d) Our method outperforms any individual tool and agent-based ensembling of tool outputs.
  • Figure 2: Overview of the proposed tool-expertise-aware chest X-ray agent (TEA-CXA) framework. (a) The agentic learning pipeline with multimodal policy model and multimodal tools. (b) Details of the generation process for a single rollout. Tool invocations are dynamically generated by MLLM. Different rollouts try out trusting different tools when tool results contradict.
  • Figure 3: (a) The system prompt instructs the policy model to follow the specified output format and consider each tool’s output as potentially trustworthy when tool results contradict. (b) For queries containing multiple images, our approach directs the policy model to generate image indices instead of file paths for efficient tool invocation. Some details are abbreviated as "(...)" due to space constraints.
  • Figure 4: Qualitative comparison on a sample in CheXbench. Although Lingshu offers more detailed justifications, our method correctly trusts MedGemma’s answer, thanks to its awareness of tools' realistic reliability across multimodal query types.