Table of Contents
Fetching ...

MINT: A wrapper to make multi-modal and multi-image AI models interactive

Jan Freyberg, Abhijit Guha Roy, Terry Spitz, Beverly Freeman, Mike Schaekermann, Patricia Strachan, Eva Schnider, Renee Wong, Dale R Webster, Alan Karthikesalingam, Yun Liu, Krishnamurthy Dvijotham, Umesh Telang

TL;DR

MINT tackles the costly data collection challenge in multi-modal medical AI by wrapping an existing multi-view classifier to actively select the most informative inputs at inference time. It introduces a value estimator and a threshold to decide whether to acquire additional images or metadata, enabling personalized, stepwise information gathering. On a dermatology dataset with multiple skin images and 25 metadata questions, MINT reduces input requirements by up to 82% for metadata and 36.2% for images while maintaining near-equivalent predictive performance, and it demonstrably lowers user drop-off in real-world deployment scenarios. The approach is simple, model-agnostic, and tunable to balance diagnostic accuracy against patient burden, with qualitative analysis showing its alignment with clinical decision-making processes and varying behavior for easy versus difficult cases.

Abstract

During the diagnostic process, doctors incorporate multimodal information including imaging and the medical history - and similarly medical AI development has increasingly become multimodal. In this paper we tackle a more subtle challenge: doctors take a targeted medical history to obtain only the most pertinent pieces of information; how do we enable AI to do the same? We develop a wrapper method named MINT (Make your model INTeractive) that automatically determines what pieces of information are most valuable at each step, and ask for only the most useful information. We demonstrate the efficacy of MINT wrapping a skin disease prediction model, where multiple images and a set of optional answers to $25$ standard metadata questions (i.e., structured medical history) are used by a multi-modal deep network to provide a differential diagnosis. We show that MINT can identify whether metadata inputs are needed and if so, which question to ask next. We also demonstrate that when collecting multiple images, MINT can identify if an additional image would be beneficial, and if so, which type of image to capture. We showed that MINT reduces the number of metadata and image inputs needed by 82% and 36.2% respectively, while maintaining predictive performance. Using real-world AI dermatology system data, we show that needing fewer inputs can retain users that may otherwise fail to complete the system submission and drop off without a diagnosis. Qualitative examples show MINT can closely mimic the step-by-step decision making process of a clinical workflow and how this is different for straight forward cases versus more difficult, ambiguous cases. Finally we demonstrate how MINT is robust to different underlying multi-model classifiers and can be easily adapted to user requirements without significant model re-training.

MINT: A wrapper to make multi-modal and multi-image AI models interactive

TL;DR

MINT tackles the costly data collection challenge in multi-modal medical AI by wrapping an existing multi-view classifier to actively select the most informative inputs at inference time. It introduces a value estimator and a threshold to decide whether to acquire additional images or metadata, enabling personalized, stepwise information gathering. On a dermatology dataset with multiple skin images and 25 metadata questions, MINT reduces input requirements by up to 82% for metadata and 36.2% for images while maintaining near-equivalent predictive performance, and it demonstrably lowers user drop-off in real-world deployment scenarios. The approach is simple, model-agnostic, and tunable to balance diagnostic accuracy against patient burden, with qualitative analysis showing its alignment with clinical decision-making processes and varying behavior for easy versus difficult cases.

Abstract

During the diagnostic process, doctors incorporate multimodal information including imaging and the medical history - and similarly medical AI development has increasingly become multimodal. In this paper we tackle a more subtle challenge: doctors take a targeted medical history to obtain only the most pertinent pieces of information; how do we enable AI to do the same? We develop a wrapper method named MINT (Make your model INTeractive) that automatically determines what pieces of information are most valuable at each step, and ask for only the most useful information. We demonstrate the efficacy of MINT wrapping a skin disease prediction model, where multiple images and a set of optional answers to standard metadata questions (i.e., structured medical history) are used by a multi-modal deep network to provide a differential diagnosis. We show that MINT can identify whether metadata inputs are needed and if so, which question to ask next. We also demonstrate that when collecting multiple images, MINT can identify if an additional image would be beneficial, and if so, which type of image to capture. We showed that MINT reduces the number of metadata and image inputs needed by 82% and 36.2% respectively, while maintaining predictive performance. Using real-world AI dermatology system data, we show that needing fewer inputs can retain users that may otherwise fail to complete the system submission and drop off without a diagnosis. Qualitative examples show MINT can closely mimic the step-by-step decision making process of a clinical workflow and how this is different for straight forward cases versus more difficult, ambiguous cases. Finally we demonstrate how MINT is robust to different underlying multi-model classifiers and can be easily adapted to user requirements without significant model re-training.
Paper Structure (32 sections, 10 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 32 sections, 10 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: a) The MINT framework wraps a multi-modal diagnostic model for interactive use, taking a partial set of images and metadata as inputs and returning the next image type, question or early stopping token. b) The workflow consists of interactive data gathering; in each round all data collected so far is passed to MINT. If MINT requests additional data, the next image or answer is collected from the user, otherwise the workflow stops and returns a final result from the diagnostic model.
  • Figure 2: Histogram of images used by MINT.
  • Figure 3: MINT for structured metadata. Left: Top-3 accuracy vs. number of interactions for different divergence metrics and baselines. Middle: Performance and average number of requested pieces of information for two variants of early stopping, using the two best-performing divergence metrics. We include the lines from the left figure for illustration purposes. Right: Histogram of number of interactions using JS divergence + ES1.
  • Figure 4: The relationship between the number of interactions requested by MINT, and the difficulty of the case as defined by disagreement between dermatologists. Shown in black is a linear regression (Spearman's $\rho=0.11$, $p=7.4\times10^{-54}$).
  • Figure 5: Simulated user outcomes. (a) The estimated proportion in users who do not complete the submission flow, with and without MINT. With MINT, we estimate the drop-off rate to be significantly lower. (b) The estimated proportion of users who see a correct result, i.e. who complete the submission flow and for which the diagnosis is in the top 3 predictions. With MINT, significantly more users see correct predictions. (c) Cost analysis using MINT. Contour-lines and density plots were derived using kernel density estimates.
  • ...and 2 more figures