Table of Contents
Fetching ...

What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities

Muchang Bahng, Charlie Berens, Jon Donnelly, Eric Chen, Chaofan Chen, Cynthia Rudin

TL;DR

This work addresses the cost and interpretability challenges of multimodal species classification by introducing two interpretable, cost-aware frameworks: Conformal Abstention Learning (CAL) and Abstention Learning ProtoTree (ALP). CAL ensembles image and genetic logits and uses conformal prediction to bound the influence of the expensive genetic modality, enabling classification with image data alone in many cases while maintaining statistical guarantees. ALP extends ProtoTree so internal nodes can consult either modality, with mechanisms to bias routing toward image-only predictions and a threshold-based initialization to favor cheaper data when accuracy remains high. On BIOSCAN-1M, CAL achieves near-parity with fully multimodal/genetic models while dramatically increasing the “success rate” of image-only predictions, and ALP offers substantial gains in data-efficiency with interpretable routing, highlighting practical pathways to reduce invasive data collection in ecological monitoring.

Abstract

Species detection is important for monitoring the health of ecosystems and identifying invasive species, serving a crucial role in guiding conservation efforts. Multimodal neural networks have seen increasing use for identifying species to help automate this task, but they have two major drawbacks. First, their black-box nature prevents the interpretability of their decision making process. Second, collecting genetic data is often expensive and requires invasive procedures, often necessitating researchers to capture or kill the target specimen. We address both of these problems by extending prototype networks (ProtoPNets), which are a popular and interpretable alternative to traditional neural networks, to the multimodal, cost-aware setting. We ensemble prototypes from each modality, using an associated weight to determine how much a given prediction relies on each modality. We further introduce methods to identify cases for which we do not need the expensive genetic information to make confident predictions. We demonstrate that our approach can intelligently allocate expensive genetic data for fine-grained distinctions while using abundant image data for clearer visual classifications and achieving comparable accuracy to models that consistently use both modalities.

What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities

TL;DR

This work addresses the cost and interpretability challenges of multimodal species classification by introducing two interpretable, cost-aware frameworks: Conformal Abstention Learning (CAL) and Abstention Learning ProtoTree (ALP). CAL ensembles image and genetic logits and uses conformal prediction to bound the influence of the expensive genetic modality, enabling classification with image data alone in many cases while maintaining statistical guarantees. ALP extends ProtoTree so internal nodes can consult either modality, with mechanisms to bias routing toward image-only predictions and a threshold-based initialization to favor cheaper data when accuracy remains high. On BIOSCAN-1M, CAL achieves near-parity with fully multimodal/genetic models while dramatically increasing the “success rate” of image-only predictions, and ALP offers substantial gains in data-efficiency with interpretable routing, highlighting practical pathways to reduce invasive data collection in ecological monitoring.

Abstract

Species detection is important for monitoring the health of ecosystems and identifying invasive species, serving a crucial role in guiding conservation efforts. Multimodal neural networks have seen increasing use for identifying species to help automate this task, but they have two major drawbacks. First, their black-box nature prevents the interpretability of their decision making process. Second, collecting genetic data is often expensive and requires invasive procedures, often necessitating researchers to capture or kill the target specimen. We address both of these problems by extending prototype networks (ProtoPNets), which are a popular and interpretable alternative to traditional neural networks, to the multimodal, cost-aware setting. We ensemble prototypes from each modality, using an associated weight to determine how much a given prediction relies on each modality. We further introduce methods to identify cases for which we do not need the expensive genetic information to make confident predictions. We demonstrate that our approach can intelligently allocate expensive genetic data for fine-grained distinctions while using abundant image data for clearer visual classifications and achieving comparable accuracy to models that consistently use both modalities.

Paper Structure

This paper contains 36 sections, 15 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: In practice, some samples are easier to classify than others -- if visual information is sufficient to classify a sample, we need not measure genetic information. Our models intelligently decide when they can abstain from measuring genetic information while maintaining accuracy and follow a transparent reasoning process.
  • Figure 2: Our multimodal extensions applied to a vanilla ProtoPNet (top) and a ProtoTree (bottom). Both architectures use the same pair of convolutional feature extractors on image and genetic data. In the ProtoPNet, we take a weighted average of the logits using CAL. In the ProtoTree, we consider either a genetic or an image prototype at each node in the tree and traverse the tree based on the prototype's similarity to the input. Some paths from the root to the leaf do not require consideration of any genetic prototypes, while others do.
  • Figure 3: The prediction head of a multimodal ProtoPNet employing conformal prediction to produce prediction sets for CAL. The image prediction head includes an additional fully connected layer to predict the genetic logits, and conformal prediction is employed to create prediction sets around them.
  • Figure 4: Example reasoning from a trained ALP. Given an input consisting of image and genetic data, we present two possible paths (indicated by highlighted nodes) in which a multimodal ProtoTree may route it to a leaf. In the left path, the first two nodes consider image prototypes and the last two consider genetic prototypes. If the image is routed down this path, we must measure genetic information. The right path, on the other hand, only considers image prototypes. We do not need to measure genetic information for images routed down this path.
  • Figure 5: Balanced accuracy vs success rate for ALP and CAL models. Red is for ProtoTree and blue is for ProtoPNet. We see that there is a very small loss in accuracy for a very large boost in success rate at small values of $\alpha$ for CAL; with $\alpha=0.1,$ there is a 0.5% decrease in accuracy from the most accurate ProtoPNet, in exchange for 59.20% success rate. In ALP, we see that there is a larger trade-off, with a 5% decrease in accuracy for 76.15% success rate, but that ALP is effective in obtaining high success rates.
  • ...and 11 more figures