Table of Contents
Fetching ...

DisentangleFormer: Spatial-Channel Decoupling for Multi-Channel Vision

Jiashu Liao, Pietro Liò, Marc de Kamps, Duygu Sarikaya

TL;DR

DisentangleFormer addresses entangled spatial and channel representations in Vision Transformers by introducing parallel spatial- and channel-token streams, an adaptive Squeezed Token Enhancer for fusion, and a multi-scale FFN to inject local context. Grounded in information-theoretic decorrelation, the architecture achieves state-of-the-art results on hyperspectral benchmarks, large-scale remote sensing, and infrared pathology, while reducing FLOPs on ImageNet. The combination of parallel disentanglement, adaptive fusion, and multi-scale contextualization yields robust, efficient multi-channel vision with broad applicability and strong empirical validation. This work advances practical multi-channel vision by enabling decorrelated, information-rich representations across diverse domains.

Abstract

Vision Transformers face a fundamental limitation: standard self-attention jointly processes spatial and channel dimensions, leading to entangled representations that prevent independent modeling of structural and semantic dependencies. This problem is especially pronounced in hyperspectral imaging, from satellite hyperspectral remote sensing to infrared pathology imaging, where channels capture distinct biophysical or biochemical cues. We propose DisentangleFormer, an architecture that achieves robust multi-channel vision representation through principled spatial-channel decoupling. Motivated by information-theoretic principles of decorrelated representation learning, our parallel design enables independent modeling of structural and semantic cues while minimizing redundancy between spatial and channel streams. Our design integrates three core components: (1) Parallel Disentanglement: Independently processes spatial-token and channel-token streams, enabling decorrelated feature learning across spatial and spectral dimensions, (2) Squeezed Token Enhancer: An adaptive calibration module that dynamically fuses spatial and channel streams, and (3) Multi-Scale FFN: complementing global attention with multi-scale local context to capture fine-grained structural and semantic dependencies. Extensive experiments on hyperspectral benchmarks demonstrate that DisentangleFormer achieves state-of-the-art performance, consistently outperforming existing models on Indian Pine, Pavia University, and Houston, the large-scale BigEarthNet remote sensing dataset, as well as an infrared pathology dataset. Moreover, it retains competitive accuracy on ImageNet while reducing computational cost by 17.8% in FLOPs. The code will be made publicly available upon acceptance.

DisentangleFormer: Spatial-Channel Decoupling for Multi-Channel Vision

TL;DR

DisentangleFormer addresses entangled spatial and channel representations in Vision Transformers by introducing parallel spatial- and channel-token streams, an adaptive Squeezed Token Enhancer for fusion, and a multi-scale FFN to inject local context. Grounded in information-theoretic decorrelation, the architecture achieves state-of-the-art results on hyperspectral benchmarks, large-scale remote sensing, and infrared pathology, while reducing FLOPs on ImageNet. The combination of parallel disentanglement, adaptive fusion, and multi-scale contextualization yields robust, efficient multi-channel vision with broad applicability and strong empirical validation. This work advances practical multi-channel vision by enabling decorrelated, information-rich representations across diverse domains.

Abstract

Vision Transformers face a fundamental limitation: standard self-attention jointly processes spatial and channel dimensions, leading to entangled representations that prevent independent modeling of structural and semantic dependencies. This problem is especially pronounced in hyperspectral imaging, from satellite hyperspectral remote sensing to infrared pathology imaging, where channels capture distinct biophysical or biochemical cues. We propose DisentangleFormer, an architecture that achieves robust multi-channel vision representation through principled spatial-channel decoupling. Motivated by information-theoretic principles of decorrelated representation learning, our parallel design enables independent modeling of structural and semantic cues while minimizing redundancy between spatial and channel streams. Our design integrates three core components: (1) Parallel Disentanglement: Independently processes spatial-token and channel-token streams, enabling decorrelated feature learning across spatial and spectral dimensions, (2) Squeezed Token Enhancer: An adaptive calibration module that dynamically fuses spatial and channel streams, and (3) Multi-Scale FFN: complementing global attention with multi-scale local context to capture fine-grained structural and semantic dependencies. Extensive experiments on hyperspectral benchmarks demonstrate that DisentangleFormer achieves state-of-the-art performance, consistently outperforming existing models on Indian Pine, Pavia University, and Houston, the large-scale BigEarthNet remote sensing dataset, as well as an infrared pathology dataset. Moreover, it retains competitive accuracy on ImageNet while reducing computational cost by 17.8% in FLOPs. The code will be made publicly available upon acceptance.

Paper Structure

This paper contains 22 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Visual validation of Parallel vs. Serial design on Pavia University. Our parallel DisentangleFormer (Full) produces significantly cleaner classification maps with sharper boundaries and less noise compared to the entangled SerialCTST and SerialSTCT baselines.
  • Figure 2: Visual Validation of Information Disentanglement via CCA. This figure compares the first canonical correlation (CCA) scatter plots for DisentangleFormer against the two serial baselines, SerialSTCT and SerialCTST, across all three HSI datasets.
  • Figure 3: The DisentangleFormer Network Architecture. Input features are processed through an Embedding Layer, then split into parallel Channel Transformer and Spatial Transformer paths. (C, HW) and (HW, C) denote the input dimensions for CT and ST paths respectively. The parallel outputs are fused via the Squeezed Token Enhancer (STE) and processed by the Multi-Scale FFN (MS-FFN). Both transformers employ standard encoder layers with multi-head self-attention. Detailed module structures are provided in the supplementary material.