Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers

Chau Pham; Bryan A. Plummer

Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers

Chau Pham, Bryan A. Plummer

TL;DR

DiChaViT, which aims to enhance the diversity in the learned features of MCI-ViT models through a novel channel sampling strategy that encourages the selection of more distinct channel sets for training, and employs regularization and initialization techniques to increase the likelihood that new information is learned from each channel.

Abstract

Multi-Channel Imaging (MCI) contains an array of challenges for encoding useful feature representations not present in traditional images. For example, images from two different satellites may both contain RGB channels, but the remaining channels can be different for each imaging source. Thus, MCI models must support a variety of channel configurations at test time. Recent work has extended traditional visual encoders for MCI, such as Vision Transformers (ViT), by supplementing pixel information with an encoding representing the channel configuration. However, these methods treat each channel equally, i.e., they do not consider the unique properties of each channel type, which can result in needless and potentially harmful redundancies in the learned features. For example, if RGB channels are always present, the other channels can focus on extracting information that cannot be captured by the RGB channels. To this end, we propose DiChaViT, which aims to enhance the diversity in the learned features of MCI-ViT models. This is achieved through a novel channel sampling strategy that encourages the selection of more distinct channel sets for training. Additionally, we employ regularization and initialization techniques to increase the likelihood that new information is learned from each channel. Many of our improvements are architecture agnostic and can be incorporated into new architectures as they are developed. Experiments on both satellite and cell microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, report DiChaViT yields a 1.5 - 5.0% gain over the state-of-the-art. Our code is publicly available at https://github.com/chaudatascience/diverse_channel_vit.

Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers

TL;DR

Abstract

Paper Structure (18 sections, 5 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 18 sections, 5 equations, 7 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Encouraging Diverse Representations in multi-channel ViTs
Enhancing channel token separation
Enhancing feature diversity for patch tokens
Diverse Channel Sampling (DCS)
Training Objective
Experiments
Experimental Setup
Datasets
Results
Analysis and Discussion
Role of Channel Tokens in MCI-ViT Models
Ablation on Feature Diversification Losses (CDL and TDL)
Ablations for Diverse Channel Sampling (DCS)
...and 3 more sections

Figures (7)

Figure 1: Comparison of the redundant information learned by different models on the HPA dataset in CHAMMI chen2023chammi). (a) Measures the mutual information between the channel tokens, which captured the configuration of channels in an image. Note we gray out the diagonal for better visualization. We find ChannelViT tokens have high mutual information, which suggests significant redundancy exists across channels zhou2022mutualzhang2021rgb. In contrast, DiChaViT has little mutual information as each channel is encouraged to learn different features. (b) We compute attention scores of the $\mathrm{[CLS]}$ token to the patch tokens in the penultimate layers and aggregate them by channel. ChannelViT (top) relies on certain channels (e.g., microtubules and nucleus) to make predictions and less on other channels (e.g., protein and er). In contrast, DiChaViT demonstrates more evenly distributed attention scores across channels, suggesting that each channel contributes more to the model's predictions.
Figure 2: An overview of DiChaViT. We introduce two regularization methods on the features and a channel sampling strategy to promote diversity in feature representations. We apply (a) Channel Diversification Loss (CDL) (Sec. \ref{['sec:channel_separation']}) for channel tokens (), and (b) Token Diversification Loss (TDL) (Sec. \ref{['sec:patch_token_separation']}) on the patch tokens (). Additionally, we (c) sample a subset of dissimilar channels using Diverse Channel Sampling (DCS) (Sec. \ref{['sec:channel_sampling']}).
Figure 3: Performance of DiChaViT on JUMP-CP and CHAMMI with and without channel tokens. Using channel tokens with orthogonal initialization (green) improves performance.
Figure 4: Impact of CDL and TDL on DiChaViT's performance. (a) & (b) We demonstrate the average top-1 test accuracy and standard deviation over three runs for different values of $\lambda_{\mathrm{CDL}}$ on So2Sat and CHAMMI. (c) Performance with different ratios of $\lambda_d$ and $\lambda_s$ in TDL on So2Sat.
Figure 5: Comparison of DCS and HCS bao2024channel in terms of the frequency (%) each channel is sampled during training on So2Sat. Unlike HCS, which provides a uniform distribution for all channels (red dashed line), some channels in DCS are trained much more than others (blue bars). For example, Real Lee-Cov channel (rightmost) is sampled twice as much as Band B8a (first bar).
...and 2 more figures

Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers

TL;DR

Abstract

Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (7)