Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning
Wenyi Lian, Patrick Micke, Joakim Lindblad, Nataša Sladoje
TL;DR
This work tackles the challenge of applying Vision Transformers to multi-channel imaging (MCI) by introducing Isolated Channel ViT (IC-ViT), which pretrains on single channels to learn channel-specific representations and then finetunes on multi-channel data. IC-ViT uses channel-wise patchifying and self-supervised pretraining (DINO) to build robust, scalable foundation models for heterogeneous MCI data, enabling efficient transfer to downstream tasks. Across benchmarks like JUMP-CP, CHAMMI, and So2Sat, IC-ViT yields consistent gains (4–14 percentage points) over channel-adaptive methods and demonstrates strong robustness to missing channels, with training efficiency that supports large-scale pretraining. The proposed framework, along with extended pretraining on diverse MCI data, positions IC-ViT as a practical pathway toward foundation models for multi-modal imaging, with code released for reproducibility.
Abstract
Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data. Our code is available at https://github.com/shermanlian/IC-ViT.
