DC-ViT: Modulating Spatial and Channel Interactions for Multi-Channel Images

Umar Marikkar; Syed Sameed Husain; Muhammad Awais; Sara Atito

DC-ViT: Modulating Spatial and Channel Interactions for Multi-Channel Images

Umar Marikkar, Syed Sameed Husain, Muhammad Awais, Sara Atito

Abstract

Training and evaluation in multi-channel imaging (MCI) remains challenging due to heterogeneous channel configurations arising from varying staining protocols, sensor types, and acquisition settings. This heterogeneity limits the applicability of fixed-channel encoders commonly used in general computer vision. Recent Multi-Channel Vision Transformers (MC-ViTs) address this by enabling flexible channel inputs, typically by jointly encoding patch tokens from all channels within a unified attention space. However, unrestricted token interactions across channels can lead to feature dilution, reducing the ability to preserve channel-specific semantics that are critical in MCI data. To address this, we propose Decoupled Vision Transformer (DC-ViT), which explicitly regulates information sharing using Decoupled Self-Attention (DSA), which decomposes token updates into two complementary pathways: spatial updates that model intra-channel structure, and channel-wise updates that adaptively integrate cross-channel information. This decoupling mitigates informational collapse while allowing selective inter-channel interaction. To further exploit these enhanced channel-specific representations, we introduce Decoupled Aggregation (DAG), which allows the model to learn task-specific channel importances. Extensive experiments across three MCI benchmarks demonstrate consistent improvements over existing MC-ViT approaches.

DC-ViT: Modulating Spatial and Channel Interactions for Multi-Channel Images

Abstract

Paper Structure (16 sections, 11 equations, 6 figures, 8 tables)

This paper contains 16 sections, 11 equations, 6 figures, 8 tables.

Introduction
Background and Related Work
Methodology
Preliminaries: The MC-ViT encoding protocol
Decoupled Vision Transformer (DC-ViT)
Experiments
Results
MCI benchmarks
Analysis
Conclusion
Further implementation details
Hyperparameter settings
DSA pseudo-code
Additional results
Comparison vs. ChaMAEViT under random patch sampling
...and 1 more sections

Figures (6)

Figure 1: DC-ViT consistently improves existing MC-ViTs across MCI benchmarks. y-axis is the average scrore over CHAMMI, JUMP-CP and So2Sat downstream tasks.
Figure 2: Schematic of DC-ViT. Channels build strong low-level representations independently, and are then introduced to cross-channel interactions. DSA(sp) and DSA(sp+ch) denote DC-ViT blocks (\ref{['eq:dcvit']}) with $l\notin \mathcal{M}$ and $l\in \mathcal{M}$ in \ref{['DSA_def']}, respectively.
Figure 3:
Figure 4:
Figure 5:
...and 1 more figures

DC-ViT: Modulating Spatial and Channel Interactions for Multi-Channel Images

Abstract

DC-ViT: Modulating Spatial and Channel Interactions for Multi-Channel Images

Authors

Abstract

Table of Contents

Figures (6)