Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis

Tuan-Anh Yang; Bao V. Q. Bui; Chanh-Quang Vo-Van; Truong-Son Hy

Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis

Tuan-Anh Yang, Bao V. Q. Bui, Chanh-Quang Vo-Van, Truong-Son Hy

Abstract

We propose a deep learning framework for COVID-19 detection and disease classification from chest CT scans that integrates both 2.5D and 3D representations to capture complementary slice-level and volumetric information. The 2.5D branch processes multi-view CT slices (axial, coronal, sagittal) using a DINOv3 vision transformer to extract robust visual features, while the 3D branch employs a ResNet-18 architecture to model volumetric context and is pretrained with Variance Risk Extrapolation (VREx) followed by supervised contrastive learning to improve cross-source robustness. Predictions from both branches are combined through logit-level ensemble inference. Experiments on the PHAROS-AIF-MIH benchmark demonstrate the effectiveness of the proposed approach: for binary COVID-19 detection, the ensemble achieves 94.48% accuracy and a 0.9426 Macro F1-score, outperforming both individual models, while for multi-class disease classification the 2.5D DINOv3 model achieves the best performance with 79.35% accuracy and a 0.7497 Macro F1-score. These results highlight the benefit of combining pretrained slice-based representations with volumetric modeling for robust multi-source medical imaging analysis. Code is available at https://github.com/HySonLab/PHAROS-AIF-MIH

Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis

Abstract

Paper Structure (34 sections, 2 figures, 10 tables)

This paper contains 34 sections, 2 figures, 10 tables.

Introduction
Related Work
Methodology
Data Preprocessing
3D Representation Learning
Architecture
Training Strategy
Stage 1: Domain Generalization Pretraining
Stage 2: Task-Specific Fine-tuning
2.5D Multi-View Representation Learning
Multi-View Slice Extraction
DINOv3 Backbone
Multi-View Feature Fusion
Ensemble Integration
Experiments
...and 19 more sections

Figures (2)

Figure 1: Overview of the proposed method. Axial CT slices are first reconstructed into a normalized $128\times128\times128$ volume through preprocessing. From the reconstructed volume, two complementary representations are learned. The 3D branch processes the full volume using a ResNet-18 architecture trained with Variance Risk Extrapolation (VREx) and supervised contrastive learning to improve cross-domain robustness. In parallel, a 2.5D multi-view branch extracts axial, coronal, and sagittal slices and processes them using a DINOv3 backbone. The predictions from both models are aggregated through an ensemble to obtain the final classification.
Figure 2: From each reconstructed CT volume, we extract slices from three orthogonal anatomical planes: Axial view (original acquisition plane), coronal view, sagittal view.

Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis

Abstract

Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis

Authors

Abstract

Table of Contents

Figures (2)