ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images

Nicolas Bourriez; Ihab Bendidi; Ethan Cohen; Gabriel Watkinson; Maxime Sanchez; Guillaume Bollot; Auguste Genovesio

ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images

Nicolas Bourriez, Ihab Bendidi, Ethan Cohen, Gabriel Watkinson, Maxime Sanchez, Guillaume Bollot, Auguste Genovesio

TL;DR

The paper addresses the challenge of heterogeneous bioimaging data where channel count and type vary across experiments, hindering cross-dataset pretraining and unified representation. It introduces ChAda-ViT, a Channel Adaptive Vision Transformer that employs Inter-Channel Attention, token padding, masking, and learnable channel/positional embeddings to map images with arbitrary channel configurations into a fixed-size embedding $\mathbb{R}^K$. Trained in a self-supervised manner on the newly proposed IDRCell100k dataset, ChAda-ViT achieves state-of-the-art performance on several downstream tasks and enables a joint embedding space across different microscopes and channel configurations. The work demonstrates strong transferability, interpretable cross-channel attention maps, and the potential to bridge disparate bioimaging assays, paving the way for cross-experiment analyses and more scalable biological imaging representations.

Abstract

Unlike color photography images, which are consistently encoded into RGB channels, biological images encompass various modalities, where the type of microscopy and the meaning of each channel varies with each experiment. Importantly, the number of channels can range from one to a dozen and their correlation is often comparatively much lower than RGB, as each of them brings specific information content. This aspect is largely overlooked by methods designed out of the bioimage field, and current solutions mostly focus on intra-channel spatial attention, often ignoring the relationship between channels, yet crucial in most biological applications. Importantly, the variable channel type and count prevent the projection of several experiments to a unified representation for large scale pre-training. In this study, we propose ChAda-ViT, a novel Channel Adaptive Vision Transformer architecture employing an Inter-Channel Attention mechanism on images with an arbitrary number, order and type of channels. We also introduce IDRCell100k, a bioimage dataset with a rich set of 79 experiments covering 7 microscope modalities, with a multitude of channel types, and counts varying from 1 to 10 per experiment. Our architecture, trained in a self-supervised manner, outperforms existing approaches in several biologically relevant downstream tasks. Additionally, it can be used to bridge the gap for the first time between assays with different microscopes, channel numbers or types by embedding various image and experimental modalities into a unified biological image representation. The latter should facilitate interdisciplinary studies and pave the way for better adoption of deep learning in biological image-based analyses. Code and Data available at https://github.com/nicoboou/chadavit.

ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images

TL;DR

. Trained in a self-supervised manner on the newly proposed IDRCell100k dataset, ChAda-ViT achieves state-of-the-art performance on several downstream tasks and enables a joint embedding space across different microscopes and channel configurations. The work demonstrates strong transferability, interpretable cross-channel attention maps, and the potential to bridge disparate bioimaging assays, paving the way for cross-experiment analyses and more scalable biological imaging representations.

Abstract

Paper Structure (17 sections, 15 equations, 16 figures, 3 tables)

This paper contains 17 sections, 15 equations, 16 figures, 3 tables.

Introduction
Related Works
Dataset
Methodology
Problem Formulation
Gold Standard Approach
Our Proposed Approach
Model Architecture
Experiments
Results
Conclusion
Acknowledgments
IDRCell100k Construction
Evaluation Tasks and Datasets
Model details
...and 2 more sections

Figures (16)

Figure 1: Performance comparison on downstream tasks, showcasing ChAda-ViT's superiority in 6 out of 8 tasks compared to existing approachesmicrosnoop using CLS token only. $R^2$ scores, normalized to 0-100, are presented for BBBC021 Channel Reconstruction and Nuclear Translocation prediction tasks. This success is attributed to the combined use of Intra-Channel and Inter-Channel Attention. Evaluation on all tokens is detailled in Appendix.
Figure 2: The ChAda-ViT model architecture, displaying the proposed channel-adaptive embedding process. This figure illustrates the token padding and masking approach for image data with an arbitrary number of channels, the split of channels into patches, and the integration of positional and channel-specific embeddings to reach a fixed size input. We use the same positional emedding for patches in the same position accross channels, and the same channel embedding for all patches of each channel.
Figure 3: Comparison of the last layer self-attention maps between One Channel ViT and ChAda-ViT on IDRCell100k image channels. ChAda-ViT, utilizing Inter-Channel Attention, effectively discerns significant cross-channel correlations (red arrow), focusing on spatially relevant areas in each channel (yellow arrows). This mechanism enables ChAda-ViT to identify critical biological features that might be overlooked with a single-channel focus.
Figure 4: Evaluation of ChAda-ViT's performance in linear probing low-data regimes, utilizing 100%, 10%, and 1% of the training data for linear probing. The results consistently demonstrate ChAda-ViT's relative performance across tasks, maintaining the performance trends in the same tasks, regardless of the data volume.
Figure 5: UMAP & PCA projections of the BBBC021 (3 channels) and Bray (5 channels) datasets in a unified representation space, derived from ChAda-ViT Moyen. The left UMAP projection highlights the possibility of projecting structurally different experiments into the same space. The right PCA projection and components are labeled by dataset, with a linear separation of the structural differences of different experiments (PC1) and different compounds (PC2) using only the two datasets' common compounds.
...and 11 more figures

ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images

TL;DR

Abstract

ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images

Authors

TL;DR

Abstract

Table of Contents

Figures (16)