Table of Contents
Fetching ...

Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning

Wenyi Lian, Patrick Micke, Joakim Lindblad, Nataša Sladoje

TL;DR

This work tackles the challenge of applying Vision Transformers to multi-channel imaging (MCI) by introducing Isolated Channel ViT (IC-ViT), which pretrains on single channels to learn channel-specific representations and then finetunes on multi-channel data. IC-ViT uses channel-wise patchifying and self-supervised pretraining (DINO) to build robust, scalable foundation models for heterogeneous MCI data, enabling efficient transfer to downstream tasks. Across benchmarks like JUMP-CP, CHAMMI, and So2Sat, IC-ViT yields consistent gains (4–14 percentage points) over channel-adaptive methods and demonstrates strong robustness to missing channels, with training efficiency that supports large-scale pretraining. The proposed framework, along with extended pretraining on diverse MCI data, positions IC-ViT as a practical pathway toward foundation models for multi-modal imaging, with code released for reproducibility.

Abstract

Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data. Our code is available at https://github.com/shermanlian/IC-ViT.

Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning

TL;DR

This work tackles the challenge of applying Vision Transformers to multi-channel imaging (MCI) by introducing Isolated Channel ViT (IC-ViT), which pretrains on single channels to learn channel-specific representations and then finetunes on multi-channel data. IC-ViT uses channel-wise patchifying and self-supervised pretraining (DINO) to build robust, scalable foundation models for heterogeneous MCI data, enabling efficient transfer to downstream tasks. Across benchmarks like JUMP-CP, CHAMMI, and So2Sat, IC-ViT yields consistent gains (4–14 percentage points) over channel-adaptive methods and demonstrates strong robustness to missing channels, with training efficiency that supports large-scale pretraining. The proposed framework, along with extended pretraining on diverse MCI data, positions IC-ViT as a practical pathway toward foundation models for multi-modal imaging, with code released for reproducibility.

Abstract

Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data. Our code is available at https://github.com/shermanlian/IC-ViT.

Paper Structure

This paper contains 22 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Left: Visualization of the attention maps generated from the standard Vision Transformer (ViT) dosovitskiy2021an, ChannelViT (C-ViT) bao2024channel, and the proposed Isolated Channel ViT (IC-ViT). Note that ViT compromises all channels into one patch token while ChannelViT and IC-ViT generate patch tokens for each channel individually. Our IC-ViT tends to extract information from both fluorescence and brightfield channels, while the latter is often ignored by ChannelViT. Right: Performance comparison on the JUMP-CP validation dataset chandrasekaran2024three.
  • Figure 2: Inter-channel correlation analysis on the JUMP-CP dataset chandrasekaran2024three. Top: Pairwise correlations of channel tokens and features across eight microscopy channels. Bottom: Example of a multi-channel microscopy image and its channel-wise attention maps, obtained from a ViT pretrained with DINO on single-channel inputs.
  • Figure 3: Illustration of channel-wise patchifying in ViT. Each channel is patchified individually, and the resulting patches are embedded into a sequence of vectors, with position and ' [CLS] ' embeddings added, to form the input to the transformer. $C_i$ denotes the embeddings of the $i$th channel.
  • Figure 4: (a) Single-channel pretraining combined. (b) Multi-channel finetuning framework. IC-ViT samples single channel images for pretraining and uses all channels for prediction.
  • Figure 5: Attention maps on an eight-channel microscopy image from JUMP-CP dataset chandrasekaran2024three, generated by IC-ViT under different training settings: pretraining only (top), after supervised fine-tuning (middle), and trained directly with labels without pretraining (bottom).
  • ...and 2 more figures