Table of Contents
Fetching ...

SAM2CLIP2SAM: Vision Language Model for Segmentation of 3D CT Scans for Covid-19 Detection

Dimitrios Kollias, Anastasios Arsenos, James Wingate, Stefanos Kollias

TL;DR

The paper addresses accurate COVID-19 detection from 3D chest CT scans by focusing on effective segmentation of the lungs. It introduces SAM2CLIP2SAM, a segmentation framework that leverages the Segment Anything Model (SAM) for part-based masks and the Contrastive Language-Image Pre-Training (CLIP) with GPT-generated prompts to identify the right and left lungs, followed by final segmentation using bounding box prompts. The segmented outputs are then fed into RACNet, a CNN-RNN classifier with a routing mechanism to handle variable slice counts, enabling robust COVID-19 vs non-COVID classification. Empirical results on COV19-CT-DB ECCV 2022 and MosMedData demonstrate substantial improvements in F1 scores over unsegmented and conventionally segmented pipelines, highlighting the practical impact of integrating vision-language segmentation with medical image analysis for reliable, cross-institution COVID-19 detection.

Abstract

This paper presents a new approach for effective segmentation of images that can be integrated into any model and methodology; the paradigm that we choose is classification of medical images (3-D chest CT scans) for Covid-19 detection. Our approach includes a combination of vision-language models that segment the CT scans, which are then fed to a deep neural architecture, named RACNet, for Covid-19 detection. In particular, a novel framework, named SAM2CLIP2SAM, is introduced for segmentation that leverages the strengths of both Segment Anything Model (SAM) and Contrastive Language-Image Pre-Training (CLIP) to accurately segment the right and left lungs in CT scans, subsequently feeding these segmented outputs into RACNet for classification of COVID-19 and non-COVID-19 cases. At first, SAM produces multiple part-based segmentation masks for each slice in the CT scan; then CLIP selects only the masks that are associated with the regions of interest (ROIs), i.e., the right and left lungs; finally SAM is given these ROIs as prompts and generates the final segmentation mask for the lungs. Experiments are presented across two Covid-19 annotated databases which illustrate the improved performance obtained when our method has been used for segmentation of the CT scans.

SAM2CLIP2SAM: Vision Language Model for Segmentation of 3D CT Scans for Covid-19 Detection

TL;DR

The paper addresses accurate COVID-19 detection from 3D chest CT scans by focusing on effective segmentation of the lungs. It introduces SAM2CLIP2SAM, a segmentation framework that leverages the Segment Anything Model (SAM) for part-based masks and the Contrastive Language-Image Pre-Training (CLIP) with GPT-generated prompts to identify the right and left lungs, followed by final segmentation using bounding box prompts. The segmented outputs are then fed into RACNet, a CNN-RNN classifier with a routing mechanism to handle variable slice counts, enabling robust COVID-19 vs non-COVID classification. Empirical results on COV19-CT-DB ECCV 2022 and MosMedData demonstrate substantial improvements in F1 scores over unsegmented and conventionally segmented pipelines, highlighting the practical impact of integrating vision-language segmentation with medical image analysis for reliable, cross-institution COVID-19 detection.

Abstract

This paper presents a new approach for effective segmentation of images that can be integrated into any model and methodology; the paradigm that we choose is classification of medical images (3-D chest CT scans) for Covid-19 detection. Our approach includes a combination of vision-language models that segment the CT scans, which are then fed to a deep neural architecture, named RACNet, for Covid-19 detection. In particular, a novel framework, named SAM2CLIP2SAM, is introduced for segmentation that leverages the strengths of both Segment Anything Model (SAM) and Contrastive Language-Image Pre-Training (CLIP) to accurately segment the right and left lungs in CT scans, subsequently feeding these segmented outputs into RACNet for classification of COVID-19 and non-COVID-19 cases. At first, SAM produces multiple part-based segmentation masks for each slice in the CT scan; then CLIP selects only the masks that are associated with the regions of interest (ROIs), i.e., the right and left lungs; finally SAM is given these ROIs as prompts and generates the final segmentation mask for the lungs. Experiments are presented across two Covid-19 annotated databases which illustrate the improved performance obtained when our method has been used for segmentation of the CT scans.
Paper Structure (14 sections, 10 equations, 3 figures, 2 tables)

This paper contains 14 sections, 10 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Our whole proposed pipeline that includes segmentation and classification tasks
  • Figure 2: The RACNet model for COVID-19 Classification
  • Figure 3: Illustration of the improvement in segmentation quality when the CT scan slices are segmented with our proposed approach, the SAM2CLIP2SAM framework (right column) vs when they are segmented with conventional approaches (middle column).