Table of Contents
Fetching ...

Exploring the Integration of Key-Value Attention Into Pure and Hybrid Transformers for Semantic Segmentation

DeShin Hwa, Tobias Holmes, Klaus Drechsler

TL;DR

The paper addresses the high computational cost of Transformers in medical image segmentation and investigates KV attention as a lighter alternative. It implements KV and QKV variants across pure SETR and hybrid CNN–Transformer encoders, including KV-pos and CvT-based variants, trained on the UW Madison GI Tract MRI dataset. KV variants achieve similar or improved segmentation metrics (Jaccard and Weighted Jaccard) with about a 10% reduction in parameters and MACs, though 2D positional encoding complicates MAC counting. The findings suggest KV attention is a viable path to efficient, locally-inferent segmentation, with potential applicability to other ViT-like architectures.

Abstract

While CNNs were long considered state of the art for image processing, the introduction of Transformer architectures has challenged this position. While achieving excellent results in image classification and segmentation, Transformers remain inherently reliant on large training datasets and remain computationally expensive. A newly introduced Transformer derivative named KV Transformer shows promising results in synthetic, NLP, and image classification tasks, while reducing complexity and memory usage. This is especially conducive to use cases where local inference is required, such as medical screening applications. We endeavoured to further evaluate the merit of KV Transformers on semantic segmentation tasks, specifically in the domain of medical imaging. By directly comparing traditional and KV variants of the same base architectures, we provide further insight into the practical tradeoffs of reduced model complexity. We observe a notable reduction in parameter count and multiply accumulate operations, while achieving similar performance from most of the KV variant models when directly compared to their QKV implementation.

Exploring the Integration of Key-Value Attention Into Pure and Hybrid Transformers for Semantic Segmentation

TL;DR

The paper addresses the high computational cost of Transformers in medical image segmentation and investigates KV attention as a lighter alternative. It implements KV and QKV variants across pure SETR and hybrid CNN–Transformer encoders, including KV-pos and CvT-based variants, trained on the UW Madison GI Tract MRI dataset. KV variants achieve similar or improved segmentation metrics (Jaccard and Weighted Jaccard) with about a 10% reduction in parameters and MACs, though 2D positional encoding complicates MAC counting. The findings suggest KV attention is a viable path to efficient, locally-inferent segmentation, with potential applicability to other ViT-like architectures.

Abstract

While CNNs were long considered state of the art for image processing, the introduction of Transformer architectures has challenged this position. While achieving excellent results in image classification and segmentation, Transformers remain inherently reliant on large training datasets and remain computationally expensive. A newly introduced Transformer derivative named KV Transformer shows promising results in synthetic, NLP, and image classification tasks, while reducing complexity and memory usage. This is especially conducive to use cases where local inference is required, such as medical screening applications. We endeavoured to further evaluate the merit of KV Transformers on semantic segmentation tasks, specifically in the domain of medical imaging. By directly comparing traditional and KV variants of the same base architectures, we provide further insight into the practical tradeoffs of reduced model complexity. We observe a notable reduction in parameter count and multiply accumulate operations, while achieving similar performance from most of the KV variant models when directly compared to their QKV implementation.

Paper Structure

This paper contains 9 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: We implemented an SETR encoder as well as a variant with KV multi-head attention.
  • Figure 2: We adapted the SETR-PUP decoder to also reduce feature dimensions during upsampling.