Table of Contents
Fetching ...

DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

Gorkem Can Ates, Yu Xin, Kuang Gong, Wei Shao

TL;DR

DCFormer, an efficient 3D image encoder that factorizes 3D convolutions into three parallel 1D convolutions along the depth, height, and width dimensions, is introduced, an efficient 3D image encoder that preserves spatial information while significantly reducing computational cost.

Abstract

Vision-language models (VLMs) have been widely applied to 2D medical image analysis due to their ability to align visual and textual representations. However, extending VLMs to 3D imaging remains computationally challenging. Existing 3D VLMs often rely on Vision Transformers (ViTs), which are computationally expensive due to the quadratic complexity of self-attention, or on 3D convolutions, which require large numbers of parameters and FLOPs as kernel size increases. We introduce DCFormer, an efficient 3D image encoder that factorizes 3D convolutions into three parallel 1D convolutions along the depth, height, and width dimensions. This design preserves spatial information while significantly reducing computational cost. Integrated into a CLIP-based vision-language framework, DCFormer is trained and evaluated on CT-RATE, a dataset of 50,188 paired 3D chest CT volumes and radiology reports. In zero-shot and fine-tuned detection of 18 pathologies, as well as in image-text retrieval tasks, DCFormer consistently outperforms state-of-the-art 3D vision encoders, including CT-ViT, ViT, ConvNeXt, PoolFormer, and TransUNet. These results highlight DCFormer's potential for scalable, clinically deployable 3D medical VLMs. Our code is available at: https://github.com/mirthAI/DCFormer.

DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

TL;DR

DCFormer, an efficient 3D image encoder that factorizes 3D convolutions into three parallel 1D convolutions along the depth, height, and width dimensions, is introduced, an efficient 3D image encoder that preserves spatial information while significantly reducing computational cost.

Abstract

Vision-language models (VLMs) have been widely applied to 2D medical image analysis due to their ability to align visual and textual representations. However, extending VLMs to 3D imaging remains computationally challenging. Existing 3D VLMs often rely on Vision Transformers (ViTs), which are computationally expensive due to the quadratic complexity of self-attention, or on 3D convolutions, which require large numbers of parameters and FLOPs as kernel size increases. We introduce DCFormer, an efficient 3D image encoder that factorizes 3D convolutions into three parallel 1D convolutions along the depth, height, and width dimensions. This design preserves spatial information while significantly reducing computational cost. Integrated into a CLIP-based vision-language framework, DCFormer is trained and evaluated on CT-RATE, a dataset of 50,188 paired 3D chest CT volumes and radiology reports. In zero-shot and fine-tuned detection of 18 pathologies, as well as in image-text retrieval tasks, DCFormer consistently outperforms state-of-the-art 3D vision encoders, including CT-ViT, ViT, ConvNeXt, PoolFormer, and TransUNet. These results highlight DCFormer's potential for scalable, clinically deployable 3D medical VLMs. Our code is available at: https://github.com/mirthAI/DCFormer.

Paper Structure

This paper contains 17 sections, 5 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Parameter count and computational cost (FLOPs) comparison for 2D and 3D standard and depthwise convolutions across kernel sizes. For simplicity, the number of output channels is fixed at $C=32$, and the depth dimension is set to $D=32$.
  • Figure 2: Parameter count and computational cost (FLOPs) comparison for standard 3D depthwise convolution and decomposed depthwise convolution.
  • Figure 3: Block illustration of MetaNeXt, ConvNeXt and DCFormer.
  • Figure 4: Hierarchical architecture of DCFormer.
  • Figure 5: DCFormer-based CLIP framework: (a) Training with paired CT volumes and reports, (b) Zero-shot inference with text prompts, (c) Fine-tuning for multi-label classification, and (d) Text-to-image retrieval based on embedding similarity.