Table of Contents
Fetching ...

AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic Images

Yihang Liu, Lianghua He, Ying Wen, Longzhen Yang, Hongzhou Chen

TL;DR

AFiRe tackles the gap in radiographic SSL by integrating anatomy-aware token-level contrastive learning with pixel-level anomaly restoration, guided by synthetic lesion augmentation. By aligning ViT token distributions with spatially-aware anatomical prototypes and selectively restoring abnormal tokens, it achieves cohesive fine-grained representations and strong generalization under limited labeling. The approach demonstrates superior performance on multi-label classification and anomaly detection across chest X-ray datasets, with qualitative localization insights from Grad-CAM and robust ablations validating each component. This anatomy-driven framework has practical impact for data-efficient radiographic analysis and precise lesion localization using only image-level annotations during training.

Abstract

Current self-supervised methods, such as contrastive learning, predominantly focus on global discrimination, neglecting the critical fine-grained anatomical details required for accurate radiographic analysis. To address this challenge, we propose an Anatomy-driven self-supervised framework for enhancing Fine-grained Representation in radiographic image analysis (AFiRe). The core idea of AFiRe is to align the anatomical consistency with the unique token-processing characteristics of Vision Transformer. Specifically, AFiRe synergistically performs two self-supervised schemes: (i) Token-wise anatomy-guided contrastive learning, which aligns image tokens based on structural and categorical consistency, thereby enhancing fine-grained spatial-anatomical discrimination; (ii) Pixel-level anomaly-removal restoration, which particularly focuses on local anomalies, thereby refining the learned discrimination with detailed geometrical information. Additionally, we propose Synthetic Lesion Mask to enhance anatomical diversity while preserving intra-consistency, which is typically corrupted by traditional data augmentations, such as Cropping and Affine transformations. Experimental results show that AFiRe: (i) provides robust anatomical discrimination, achieving more cohesive feature clusters compared to state-of-the-art contrastive learning methods; (ii) demonstrates superior generalization, surpassing 7 radiography-specific self-supervised methods in multi-label classification tasks with limited labeling; and (iii) integrates fine-grained information, enabling precise anomaly detection using only image-level annotations.

AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic Images

TL;DR

AFiRe tackles the gap in radiographic SSL by integrating anatomy-aware token-level contrastive learning with pixel-level anomaly restoration, guided by synthetic lesion augmentation. By aligning ViT token distributions with spatially-aware anatomical prototypes and selectively restoring abnormal tokens, it achieves cohesive fine-grained representations and strong generalization under limited labeling. The approach demonstrates superior performance on multi-label classification and anomaly detection across chest X-ray datasets, with qualitative localization insights from Grad-CAM and robust ablations validating each component. This anatomy-driven framework has practical impact for data-efficient radiographic analysis and precise lesion localization using only image-level annotations during training.

Abstract

Current self-supervised methods, such as contrastive learning, predominantly focus on global discrimination, neglecting the critical fine-grained anatomical details required for accurate radiographic analysis. To address this challenge, we propose an Anatomy-driven self-supervised framework for enhancing Fine-grained Representation in radiographic image analysis (AFiRe). The core idea of AFiRe is to align the anatomical consistency with the unique token-processing characteristics of Vision Transformer. Specifically, AFiRe synergistically performs two self-supervised schemes: (i) Token-wise anatomy-guided contrastive learning, which aligns image tokens based on structural and categorical consistency, thereby enhancing fine-grained spatial-anatomical discrimination; (ii) Pixel-level anomaly-removal restoration, which particularly focuses on local anomalies, thereby refining the learned discrimination with detailed geometrical information. Additionally, we propose Synthetic Lesion Mask to enhance anatomical diversity while preserving intra-consistency, which is typically corrupted by traditional data augmentations, such as Cropping and Affine transformations. Experimental results show that AFiRe: (i) provides robust anatomical discrimination, achieving more cohesive feature clusters compared to state-of-the-art contrastive learning methods; (ii) demonstrates superior generalization, surpassing 7 radiography-specific self-supervised methods in multi-label classification tasks with limited labeling; and (iii) integrates fine-grained information, enabling precise anomaly detection using only image-level annotations.

Paper Structure

This paper contains 21 sections, 35 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Conception of the proposed method. (a) Local radiographic structures at the same location exhibit anatomical consistency. (b) By aligning anatomical consistency with the token-processing characteristics of ViT, tokens at the same position within a batch share similar structural semantics, while those at different positions convey distinct ones.
  • Figure 2: Overview of the proposed AFiRe. It synergistically performs two self-supervised proxy tasks: Token-wise anatomy-guided contrastive learning (Task I) and Pixel-level anomaly-removal restoration (Task II). For each normal input $x_i$, we perturb it using the designed Synthetic Lesion Mask ($M_i$) to produce abnormal input $x^\prime_i$. In Task I, a group of spatial-aware prototypes, updated by the teacher network's output, serve as pseudo-cluster labels to maximize alignment among tokens from student networks belonging to the same class or structure. In Task II, the restoration target particularly focuses on the abnormal tokens from augmented pairs of normal radiographic images by substituting them with mask tokens in the latent space.
  • Figure 3: Updating process of the spatial-aware prototypes. The cluster assignment of $\mathbf{E}^\text{T}$ is used for updating the spatial-aware prototypes.
  • Figure 4: Token-wise anatomy-guided contrastive learning. $\mathcal{L}_{\text{cst}}^\text{stru.}$ and $\mathcal{L}_{\text{cst}}^\text{cate.}$ correspond to the structure-consistency and category-consistency contrastive losses, respectively.
  • Figure 5: T-SNE visualizations of the learned representation. Different colors represent various anomalies (disease classes) in (a) and different image locations in (b).
  • ...and 6 more figures