Table of Contents
Fetching ...

CLARiTy: A Vision Transformer for Multi-Label Classification and Weakly-Supervised Localization of Chest X-ray Pathologies

John M. Statheros, Hairong Wang, Richard Klein

TL;DR

CLARiTy presents a Vision Transformer framework with multiple class tokens and SegmentCAM to jointly perform multi-label chest X-ray pathology classification and weakly-supervised localization using only image-level labels and anatomical priors. It achieves competitive classification accuracy on NIH ChestX-ray14 and state-of-the-art localization, notably improving Macro IoU at higher IoU thresholds and excelling for small lesions; a low-resolution variant demonstrates efficiency suitable for resource-constrained settings. The approach is bolstered by distillation from a ConvNeXtV2 teacher and self-supervised pretraining (DINO), with ablations confirming the value of SegmentCAM, orthogonal class-token loss, and attention pooling. This work emphasizes interpretable heatmaps and bias-aware localization, addressing common shortcut-learning concerns in chest X-ray analysis. Overall, CLARiTy advances heatmap quality and class-specific localization while maintaining robust classification, offering practical benefits for automated CXR screening and potential extensions to other modalities.

Abstract

The interpretation of chest X-rays (CXRs) poses significant challenges, particularly in achieving accurate multi-label pathology classification and spatial localization. These tasks demand different levels of annotation granularity but are frequently constrained by the scarcity of region-level (dense) annotations. We introduce CLARiTy (Class Localizing and Attention Refining Image Transformer), a vision transformer-based model for joint multi-label classification and weakly-supervised localization of thoracic pathologies. CLARiTy employs multiple class-specific tokens to generate discriminative attention maps, and a SegmentCAM module for foreground segmentation and background suppression using explicit anatomical priors. Trained on image-level labels from the NIH ChestX-ray14 dataset, it leverages distillation from a ConvNeXtV2 teacher for efficiency. Evaluated on the official NIH split, the CLARiTy-S-16-512 (a configuration of CLARiTy), achieves competitive classification performance across 14 pathologies, and state-of-the-art weakly-supervised localization performance on 8 pathologies, outperforming prior methods by 50.7%. In particular, pronounced gains occur for small pathologies like nodules and masses. The lower-resolution variant of CLARiTy, CLARiTy-S-16-224, offers high efficiency while decisively surpassing baselines, thereby having the potential for use in low-resource settings. An ablation study confirms contributions of SegmentCAM, DINO pretraining, orthogonal class token loss, and attention pooling. CLARiTy advances beyond CNN-ViT hybrids by harnessing ViT self-attention for global context and class-specific localization, refined through convolutional background suppression for precise, noise-reduced heatmaps.

CLARiTy: A Vision Transformer for Multi-Label Classification and Weakly-Supervised Localization of Chest X-ray Pathologies

TL;DR

CLARiTy presents a Vision Transformer framework with multiple class tokens and SegmentCAM to jointly perform multi-label chest X-ray pathology classification and weakly-supervised localization using only image-level labels and anatomical priors. It achieves competitive classification accuracy on NIH ChestX-ray14 and state-of-the-art localization, notably improving Macro IoU at higher IoU thresholds and excelling for small lesions; a low-resolution variant demonstrates efficiency suitable for resource-constrained settings. The approach is bolstered by distillation from a ConvNeXtV2 teacher and self-supervised pretraining (DINO), with ablations confirming the value of SegmentCAM, orthogonal class-token loss, and attention pooling. This work emphasizes interpretable heatmaps and bias-aware localization, addressing common shortcut-learning concerns in chest X-ray analysis. Overall, CLARiTy advances heatmap quality and class-specific localization while maintaining robust classification, offering practical benefits for automated CXR screening and potential extensions to other modalities.

Abstract

The interpretation of chest X-rays (CXRs) poses significant challenges, particularly in achieving accurate multi-label pathology classification and spatial localization. These tasks demand different levels of annotation granularity but are frequently constrained by the scarcity of region-level (dense) annotations. We introduce CLARiTy (Class Localizing and Attention Refining Image Transformer), a vision transformer-based model for joint multi-label classification and weakly-supervised localization of thoracic pathologies. CLARiTy employs multiple class-specific tokens to generate discriminative attention maps, and a SegmentCAM module for foreground segmentation and background suppression using explicit anatomical priors. Trained on image-level labels from the NIH ChestX-ray14 dataset, it leverages distillation from a ConvNeXtV2 teacher for efficiency. Evaluated on the official NIH split, the CLARiTy-S-16-512 (a configuration of CLARiTy), achieves competitive classification performance across 14 pathologies, and state-of-the-art weakly-supervised localization performance on 8 pathologies, outperforming prior methods by 50.7%. In particular, pronounced gains occur for small pathologies like nodules and masses. The lower-resolution variant of CLARiTy, CLARiTy-S-16-224, offers high efficiency while decisively surpassing baselines, thereby having the potential for use in low-resource settings. An ablation study confirms contributions of SegmentCAM, DINO pretraining, orthogonal class token loss, and attention pooling. CLARiTy advances beyond CNN-ViT hybrids by harnessing ViT self-attention for global context and class-specific localization, refined through convolutional background suppression for precise, noise-reduced heatmaps.

Paper Structure

This paper contains 53 sections, 24 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Illustration of the proposed CLARiTy model. An input chest X-ray image is split into $N^2$ patches and embedded into $P$ patch tokens, where they are concatenated with $C$ class tokens. Learned positional embeddings are added to produce $C+P$ input tokens to the transformer. A series of $d$ transformer blocks extract relevant information for classification and weakly-supervised localization. At the output, the $C$ class tokens are passed through an attention pooling module to produce class token logits. The output $P$ patch tokens are passed through the SegmentCAM module, where foreground logits and a segmentation map are produced. The self-attention maps from the final $p$ transformer blocks are fused together to produce class-specific attention maps.
  • Figure 2: Transformer self-attention matrix with multiple class tokens. Each class token is denoted $c_1, c_2, \dots, c_C$, and each patch token is denoted $n_1, n_2, \dots, n_{ P}$. The class-to-patch attention in the upper-right sub-matrix is extracted for the class-specific attention maps.
  • Figure 3: Weakly-supervised localization method of CLARiTy. During inference, a foreground mask and attention map are produced for each class. Element-wise multiplication yields a class-specific localization heatmap that is both highly precise and confined to the class' ground truth region. In this example chest X-ray, a mass (round opacity) is found in the upper-middle left lung of the patient. The attention map has high intensity directly over the mass, and the heatmap intensity is confined to the lung lobes.
  • Figure 4: Illustration of the proposed CLARiTy model. During training, the final $q$ layers of class tokens are regularized using the orthogonal class token loss $\mathcal{L}_{\mathrm{OCT}}$, which promotes orthogonality between class tokens. Following the attention pooling layer, the class token logits are used to calculate the class token classification loss $\mathcal{L}_{\mathrm{CLS}}^{C}$. The outputs of the SegmentCAM module are used to calculate the combined CAM loss $\mathcal{L}_{\mathrm{CAM}}$.
  • Figure 5: Illustration of the attention pooling module. Row-wise softmax activation is applied to a set of learned weights, which are invariant to the input image. This matrix is element-wise multiplied with the output class tokens after reshaping to a $C\times D$ matrix. Thereafter, a row-wise sum produces class-specific logits. Each dimension in $D$ is able to attend to particular class features, which complements the orthogonal class token loss $\mathcal{L}_{\mathrm{OCT}}$.
  • ...and 6 more figures