Table of Contents
Fetching ...

LQ-Adapter: ViT-Adapter with Learnable Queries for Gallbladder Cancer Detection from Ultrasound Image

Chetan Madan, Mayuna Gupta, Soumen Basu, Pankaj Gupta, Chetan Arora

TL;DR

This work tackles gallbladder cancer detection in ultrasound images, where noise and small pathology complicate localization. It introduces LQ-Adapter, a learnable-query extension of ViT-Adapter, to strengthen localization through content-aware queries integrated with a CNN-based spatial prior. LQ-Adapter achieves state-of-the-art mean IoU on the GBCU dataset and demonstrates competitive results with substantially fewer trainable parameters, while also generalizing to polyp detection in Kvasir-Seg, indicating cross-domain applicability. The results suggest that leveraging learnable queries on top of frozen foundation-model backbones can yield robust, data-efficient performance for medical imaging tasks without heavy architectural redesigns.

Abstract

We focus on the problem of Gallbladder Cancer (GBC) detection from Ultrasound (US) images. The problem presents unique challenges to modern Deep Neural Network (DNN) techniques due to low image quality arising from noise, textures, and viewpoint variations. Tackling such challenges would necessitate precise localization performance by the DNN to identify the discerning features for the downstream malignancy prediction. While several techniques have been proposed in the recent years for the problem, all of these methods employ complex custom architectures. Inspired by the success of foundational models for natural image tasks, along with the use of adapters to fine-tune such models for the custom tasks, we investigate the merit of one such design, ViT-Adapter, for the GBC detection problem. We observe that ViT-Adapter relies predominantly on a primitive CNN-based spatial prior module to inject the localization information via cross-attention, which is inefficient for our problem due to the small pathology sizes, and variability in their appearances due to non-regular structure of the malignancy. In response, we propose, LQ-Adapter, a modified Adapter design for ViT, which improves localization information by leveraging learnable content queries over the basic spatial prior module. Our method surpasses existing approaches, enhancing the mean IoU (mIoU) scores by 5.4%, 5.8%, and 2.7% over ViT-Adapters, DINO, and FocalNet-DINO, respectively on the US image-based GBC detection dataset, and establishing a new state-of-the-art (SOTA). Additionally, we validate the applicability and effectiveness of LQ-Adapter on the Kvasir-Seg dataset for polyp detection from colonoscopy images. Superior performance of our design on this problem as well showcases its capability to handle diverse medical imaging tasks across different datasets. Code is released at https://github.com/ChetanMadan/LQ-Adapter

LQ-Adapter: ViT-Adapter with Learnable Queries for Gallbladder Cancer Detection from Ultrasound Image

TL;DR

This work tackles gallbladder cancer detection in ultrasound images, where noise and small pathology complicate localization. It introduces LQ-Adapter, a learnable-query extension of ViT-Adapter, to strengthen localization through content-aware queries integrated with a CNN-based spatial prior. LQ-Adapter achieves state-of-the-art mean IoU on the GBCU dataset and demonstrates competitive results with substantially fewer trainable parameters, while also generalizing to polyp detection in Kvasir-Seg, indicating cross-domain applicability. The results suggest that leveraging learnable queries on top of frozen foundation-model backbones can yield robust, data-efficient performance for medical imaging tasks without heavy architectural redesigns.

Abstract

We focus on the problem of Gallbladder Cancer (GBC) detection from Ultrasound (US) images. The problem presents unique challenges to modern Deep Neural Network (DNN) techniques due to low image quality arising from noise, textures, and viewpoint variations. Tackling such challenges would necessitate precise localization performance by the DNN to identify the discerning features for the downstream malignancy prediction. While several techniques have been proposed in the recent years for the problem, all of these methods employ complex custom architectures. Inspired by the success of foundational models for natural image tasks, along with the use of adapters to fine-tune such models for the custom tasks, we investigate the merit of one such design, ViT-Adapter, for the GBC detection problem. We observe that ViT-Adapter relies predominantly on a primitive CNN-based spatial prior module to inject the localization information via cross-attention, which is inefficient for our problem due to the small pathology sizes, and variability in their appearances due to non-regular structure of the malignancy. In response, we propose, LQ-Adapter, a modified Adapter design for ViT, which improves localization information by leveraging learnable content queries over the basic spatial prior module. Our method surpasses existing approaches, enhancing the mean IoU (mIoU) scores by 5.4%, 5.8%, and 2.7% over ViT-Adapters, DINO, and FocalNet-DINO, respectively on the US image-based GBC detection dataset, and establishing a new state-of-the-art (SOTA). Additionally, we validate the applicability and effectiveness of LQ-Adapter on the Kvasir-Seg dataset for polyp detection from colonoscopy images. Superior performance of our design on this problem as well showcases its capability to handle diverse medical imaging tasks across different datasets. Code is released at https://github.com/ChetanMadan/LQ-Adapter

Paper Structure

This paper contains 26 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We compare model sizes and performance (mean intersection-over-union) of SOTA transformer-based object detection methods on the GBCU dataset. It highlights the superiority of LQ-Adapter by demonstrating its ability to achieve competitive performance while maintaining a more efficient parameter footprint than existing methods.
  • Figure 2: Schematic architecture diagram of the proposed LQ-Adapter. The learnable content queries are added to the extractor blocks of adapter modules for improved localization. FFN: Feed Forward Network, LQ: Learnable Queries, $F_{sp}$: Features from the Spatial Prior Module, $F_{vit}$: Features from the frozen ViT vit backone
  • Figure 3: Sample images from GBCU basu2022surpassing, and Kvasir-Seg kvasirseg datasets. Malignant and benign samples from GBCU are on the left and right, respectively. Kvasir-Seg dataset kvasirseg does not contain control images, so both sides showed images with polyp tissue
  • Figure 4: Ablation Study. (a) Shows the effect of the number of learnable query blocks on performance. We observe that augmenting all layers with learnable queries results in the highest performance gain. (b) The effect of initializing the queries with zero values and random values.
  • Figure 5: We motivate the use of learnable content queries in the adapter design. We show sample localizations by ViT-Adapter vita, DINO dino, FocalNet-DINO focalnetfocalstabledino, and LQ-Adapter (ours). Primitive spatial prior modules in ViT-Adapter do not capture the salient region well, reducing detection quality. LQ-Adapter, on the other hand, can learn the region information well via the learnable queries and thus demonstrate superior localization performance. Rows (a)-(c) show selected samples from the GBCU dataset basu2022surpassing and rows (d)-(f) are samples from the Kvasir-Seg dataset kvasirseg. (green bounding boxes: ground truth, red bounding boxes: prediction).