SAM2CLIP2SAM: Vision Language Model for Segmentation of 3D CT Scans for Covid-19 Detection

Dimitrios Kollias; Anastasios Arsenos; James Wingate; Stefanos Kollias

SAM2CLIP2SAM: Vision Language Model for Segmentation of 3D CT Scans for Covid-19 Detection

Dimitrios Kollias, Anastasios Arsenos, James Wingate, Stefanos Kollias

TL;DR

The paper addresses accurate COVID-19 detection from 3D chest CT scans by focusing on effective segmentation of the lungs. It introduces SAM2CLIP2SAM, a segmentation framework that leverages the Segment Anything Model (SAM) for part-based masks and the Contrastive Language-Image Pre-Training (CLIP) with GPT-generated prompts to identify the right and left lungs, followed by final segmentation using bounding box prompts. The segmented outputs are then fed into RACNet, a CNN-RNN classifier with a routing mechanism to handle variable slice counts, enabling robust COVID-19 vs non-COVID classification. Empirical results on COV19-CT-DB ECCV 2022 and MosMedData demonstrate substantial improvements in F1 scores over unsegmented and conventionally segmented pipelines, highlighting the practical impact of integrating vision-language segmentation with medical image analysis for reliable, cross-institution COVID-19 detection.

Abstract

This paper presents a new approach for effective segmentation of images that can be integrated into any model and methodology; the paradigm that we choose is classification of medical images (3-D chest CT scans) for Covid-19 detection. Our approach includes a combination of vision-language models that segment the CT scans, which are then fed to a deep neural architecture, named RACNet, for Covid-19 detection. In particular, a novel framework, named SAM2CLIP2SAM, is introduced for segmentation that leverages the strengths of both Segment Anything Model (SAM) and Contrastive Language-Image Pre-Training (CLIP) to accurately segment the right and left lungs in CT scans, subsequently feeding these segmented outputs into RACNet for classification of COVID-19 and non-COVID-19 cases. At first, SAM produces multiple part-based segmentation masks for each slice in the CT scan; then CLIP selects only the masks that are associated with the regions of interest (ROIs), i.e., the right and left lungs; finally SAM is given these ROIs as prompts and generates the final segmentation mask for the lungs. Experiments are presented across two Covid-19 annotated databases which illustrate the improved performance obtained when our method has been used for segmentation of the CT scans.

SAM2CLIP2SAM: Vision Language Model for Segmentation of 3D CT Scans for Covid-19 Detection

TL;DR

Abstract

SAM2CLIP2SAM: Vision Language Model for Segmentation of 3D CT Scans for Covid-19 Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)