Table of Contents
Fetching ...

SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning

Jiayi Wang, Hadrien Reynaud, Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Bjoern Menze, Bernhard Kainz

TL;DR

This work implements Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly and ensures consistent alignment between query and key projections and adapts to any input sizes.

Abstract

Large-scale, volumetric medical imaging datasets typically aggregate scans from different vendors and devices, resulting in highly variable resolution, slice thicknesses, and numbers of slices per study. Consequently, training representation models usually requires cropping or interpolating along the z-axis to obtain fixed-size blocks, which inevitably causes information loss. We propose a new training approach to overcome this limitation. Instead of absolute position embeddings, we interpret volumes as sequences of 3D chunks and adopt Rotary Position Embeddings, allowing us to treat the z-axis as an unconstrained temporal dimensions. Building on this idea, we introduce a new vision-language model: SigVLP. In SigVLP, we implement Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly. This design ensures consistent alignment between query and key projections and adapts to any input sizes. To allow for variable input size during training, we sample Computed Tomography volumes in chunks and pair them with localized organ-wise textual observations. Compared to using entire reports for conditioning, chunkwise alignment provides finer-grained supervision, enabling the model to establish stronger correlations between the text and volume representations, thereby improving the precision of text-to-volume alignment. Our models are trained with the Muon optimizer and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.

SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning

TL;DR

This work implements Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly and ensures consistent alignment between query and key projections and adapts to any input sizes.

Abstract

Large-scale, volumetric medical imaging datasets typically aggregate scans from different vendors and devices, resulting in highly variable resolution, slice thicknesses, and numbers of slices per study. Consequently, training representation models usually requires cropping or interpolating along the z-axis to obtain fixed-size blocks, which inevitably causes information loss. We propose a new training approach to overcome this limitation. Instead of absolute position embeddings, we interpret volumes as sequences of 3D chunks and adopt Rotary Position Embeddings, allowing us to treat the z-axis as an unconstrained temporal dimensions. Building on this idea, we introduce a new vision-language model: SigVLP. In SigVLP, we implement Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly. This design ensures consistent alignment between query and key projections and adapts to any input sizes. To allow for variable input size during training, we sample Computed Tomography volumes in chunks and pair them with localized organ-wise textual observations. Compared to using entire reports for conditioning, chunkwise alignment provides finer-grained supervision, enabling the model to establish stronger correlations between the text and volume representations, thereby improving the precision of text-to-volume alignment. Our models are trained with the Muon optimizer and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.
Paper Structure (13 sections, 8 equations, 11 figures, 7 tables)

This paper contains 13 sections, 8 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Radar chart of average Precision for organ presence classification using Linear Probes: comparison between SigVLP (ours), CT-Clip and DINOv3 variants (Base, Large) with identical linear probe structures.
  • Figure 2: Our approach for Organ-wise Radiology Report Generation: The original CT volume is segmented into organ masks wasserthal2023totalsegmentatorxu2025cads. The volume is then split into blocks of different lengths, where the masks indicate which organs should be included for report generation. Organ-specific findings are extracted with GPT-5 mini to construct an organ findings bank, stored as individual entries. A general description is appended to summarize the entire volume.
  • Figure 3: UMAP mcinnes2018umap visualization of evaluation embeddings across baselines and our model. Colors indicate abnormality classes, with similar hues corresponding to semantically related labels. Top row: DINOv3-base, CT-CLIP, CT-Vocab (vocabulary-finetuned), CT-LiPro (classification-finetuned). Bottom row: Our method at 2k, 4k, 6k, and 234,930 training steps.
  • Figure 4: Qualitative comparison of segmentation overlays across four slices (125, 250, 390, 460) in an example volume. Columns correspond to Ground Truth, DINOv3-base, and our method. Rows correspond to different slices, with vertically centered rotated labels for compact visualization.
  • Figure 5: Classification Linear probe F1 score vs. number of slices. Dashed line: DINOv3-Base; solid line: SigVLP; point size shows required floating-point operations (FLOP).
  • ...and 6 more figures