Table of Contents
Fetching ...

Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning

Tuan Truong, Melanie Dohmen, Sara Lorio, Matthias Lenga

TL;DR

The results demonstrate that explicitly modeling metadata sparsity and cross-modal interactions improves robustness for DICOM series classification, and the proposed method consistently outperforms relevant image only, metadata-only and multimodal 2D/3D baselines.

Abstract

Automated identification of DICOM image series is essential for large-scale medical image analysis, quality control, protocol harmonization, and reliable downstream processing. However, DICOM series classification remains challenging due to heterogeneous slice content, variable series length, and entirely missing, incomplete or inconsistent DICOM metadata. We propose an end-to-end multimodal framework for DICOM series classification that jointly models image content and acquisition metadata while explicitly accounting for all these challenges. (i) Images and metadata are encoded with modality-aware modules and fused using a bi-directional cross-modal attention mechanism. (ii) Metadata is processed by a sparse, missingness-aware encoder based on learnable feature dictionaries and value-conditioned modulation. By design, the approach does not require any form of imputation. (iii) Variability in series length and image data dimensions is handled via a 2.5D visual encoder and attention operating on equidistantly sampled slices. We evaluate the proposed approach on the publicly available Duke Liver MRI dataset and a large multi-institutional in-house cohort, assessing both in-domain performance and out-of-domain generalization. Across all evaluation settings, the proposed method consistently outperforms relevant image only, metadata-only and multimodal 2D/3D baselines. The results demonstrate that explicitly modeling metadata sparsity and cross-modal interactions improves robustness for DICOM series classification.

Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning

TL;DR

The results demonstrate that explicitly modeling metadata sparsity and cross-modal interactions improves robustness for DICOM series classification, and the proposed method consistently outperforms relevant image only, metadata-only and multimodal 2D/3D baselines.

Abstract

Automated identification of DICOM image series is essential for large-scale medical image analysis, quality control, protocol harmonization, and reliable downstream processing. However, DICOM series classification remains challenging due to heterogeneous slice content, variable series length, and entirely missing, incomplete or inconsistent DICOM metadata. We propose an end-to-end multimodal framework for DICOM series classification that jointly models image content and acquisition metadata while explicitly accounting for all these challenges. (i) Images and metadata are encoded with modality-aware modules and fused using a bi-directional cross-modal attention mechanism. (ii) Metadata is processed by a sparse, missingness-aware encoder based on learnable feature dictionaries and value-conditioned modulation. By design, the approach does not require any form of imputation. (iii) Variability in series length and image data dimensions is handled via a 2.5D visual encoder and attention operating on equidistantly sampled slices. We evaluate the proposed approach on the publicly available Duke Liver MRI dataset and a large multi-institutional in-house cohort, assessing both in-domain performance and out-of-domain generalization. Across all evaluation settings, the proposed method consistently outperforms relevant image only, metadata-only and multimodal 2D/3D baselines. The results demonstrate that explicitly modeling metadata sparsity and cross-modal interactions improves robustness for DICOM series classification.
Paper Structure (13 sections, 3 equations, 2 figures, 4 tables)

This paper contains 13 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Proposed method: pixel data of $S$ DICOM slices is embedded in visual feature pathway. DICOM metadata is embedded by the Sparse Metadata Encoder. Bi-directional cross-modal attention contextualizes all image and metadata embeddings. Final integration to a series-level representation is done by learnable pooling.
  • Figure 2: In-domain evaluation: five-fold cross-validation per-class F1 scores (%) on the Duke Liver MRI dataset.