Unsupervised Feature Orthogonalization for Learning Distortion-Invariant Representations

Sebastian Doerrich; Francesco Di Salvo; Christian Ledig

Unsupervised Feature Orthogonalization for Learning Distortion-Invariant Representations

Sebastian Doerrich, Francesco Di Salvo, Christian Ledig

TL;DR

This study introduces unORANIC+, a novel method that integrates unsupervised feature orthogonalization with the ability of a Vision Transformer to capture both local and global relationships for improved robustness and generalizability and positions the method as a promising algorithm for advanced medical image analysis.

Abstract

This study introduces unORANIC+, a novel method that integrates unsupervised feature orthogonalization with the ability of a Vision Transformer to capture both local and global relationships for improved robustness and generalizability. The streamlined architecture of unORANIC+ effectively separates anatomical and image-specific attributes, resulting in robust and unbiased latent representations that allow the model to demonstrate excellent performance across various medical image analysis tasks and diverse datasets. Extensive experimentation demonstrates unORANIC+'s reconstruction proficiency, corruption resilience, as well as capability to revise existing image distortions. Additionally, the model exhibits notable aptitude in downstream tasks such as disease classification and corruption detection. We confirm its adaptability to diverse datasets of varying image sources and sample sizes which positions the method as a promising algorithm for advanced medical image analysis, particularly in resource-constrained environments lacking large, tailored datasets. The source code is available at https://github.com/sdoerrich97/unoranic-plus .

Unsupervised Feature Orthogonalization for Learning Distortion-Invariant Representations

TL;DR

This study introduces unORANIC+, a novel method that integrates unsupervised feature orthogonalization with the ability of a Vision Transformer to capture both local and global relationships for improved robustness and generalizability and positions the method as a promising algorithm for advanced medical image analysis.

Abstract

This study introduces unORANIC+, a novel method that integrates unsupervised feature orthogonalization with the ability of a Vision Transformer to capture both local and global relationships for improved robustness and generalizability. The streamlined architecture of unORANIC+ effectively separates anatomical and image-specific attributes, resulting in robust and unbiased latent representations that allow the model to demonstrate excellent performance across various medical image analysis tasks and diverse datasets. Extensive experimentation demonstrates unORANIC+'s reconstruction proficiency, corruption resilience, as well as capability to revise existing image distortions. Additionally, the model exhibits notable aptitude in downstream tasks such as disease classification and corruption detection. We confirm its adaptability to diverse datasets of varying image sources and sample sizes which positions the method as a promising algorithm for advanced medical image analysis, particularly in resource-constrained environments lacking large, tailored datasets. The source code is available at https://github.com/sdoerrich97/unoranic-plus .

Paper Structure (13 sections, 6 figures, 4 tables)

This paper contains 13 sections, 6 figures, 4 tables.

Introduction
Related work
Orthogonalization of anatomy and image characteristics
Vision Transformer autoencoder
Method
Baseline
unORANIC+
Training and application
Experiments and results
Reconstruction and corruption revision
Disease classification and corruption detection
Evaluation on higher dimensional datasets
Discussion and Conclusion

Figures (6)

Figure 1: (a) Illustration of domain shifts in terms of different contrasts and brightness levels among manufacturers (i-iii), among models from the same producer (iv-vi), and among the same model at different sites (vii-ix) for the same slice of the same individual across multiple scans. (b) High-level overview of our proposed approach. During training, the encoder $E$ is trained to orthogonalize anatomical and image-characteristic features in an input image (orange path). Once trained, the learned feature orthogonalization by the frozen encoder is used for various downstream tasks, including bias removal, corruption detection and revision in input images, as well as robust, distortion-invariant disease classification (purple path).
Figure 2: Schematic representation of the training pipeline for unORANIC (adapted from Doerrich2024). The input image $I$ is assumed to be bias-free and uncorrupted. Random augmentations $\mathcal{A}_S$, $\mathcal{A}_{v_1}$, and $\mathcal{A}_{v_2}$ distort $I$ to generate synthetic corrupted versions $S$, $V_1$, and $V_2$ with identical anatomical information but different distortions. These distorted images are processed by the shared anatomy encoder $E_A$, which uses the consistency loss $\mathcal{L}_C$ to learn anatomical, distortion-invariant features. Concurrently, $S$ is processed by the characteristic encoder $E_C$ to capture image-specific details such as contrast and brightness. Reconstruction losses $\mathcal{L}_{R_S}$ and $\mathcal{L}_{R_I}$ are applied to the reconstructed images $\hat{S}$ and $\hat{I}_A$ by decoder $D$ and $D_A$, respectively, to ensure that $E_A$ and $E_C$ learn comprehensive, reliable features.
Figure 3: Schematic representation of the training pipeline for the refined unORANIC+ method on the example of chest X-ray images. The polar arrows illustrate the forward propagation and gradient flow, respectively. During training, an input image $I$ is augmented with a random set of distortions, $\mathcal{A}_S$, to generate the synthetic, distorted image $S$. $S$ is subsequently divided into non-overlapping patches before it is fed through the single Vision Transformer (ViT) encoder $E$ to map the input image to a higher-dimensional latent space. Two ViT decoders, $D$ and $D_A$, are used to reconstruct the original synthetic image $\hat{S}$ as well as a bias-free anatomical reconstruction $\hat{I}_A$, respectively. The two reconstruction losses $\mathcal{L}_{\text{R}_{S}}$ and $\mathcal{L}_{\text{R}_{I}}$ guide the separation of anatomical and image-characteristic features in the latent space and ensure a high quality of the reconstructions.
Figure 4: Examples from the datasets of the MedMNIST v2 benchmark Yang2023 used for evaluating our approach (left to right: blood, breast, chest, derma, pneumonia, and retina)
Figure 5: Corruption revision capabilities of unORANIC and unORANIC+. In (a), their reconstruction consistency is depicted despite the corruption-related image quality loss (PSNR between the original image $I$ and the distorted variant $S$ - "green dotted line"). (b) highlights the distortion correction capabilities of both methods using Gaussian noise as an example.
...and 1 more figures

Unsupervised Feature Orthogonalization for Learning Distortion-Invariant Representations

TL;DR

Abstract

Unsupervised Feature Orthogonalization for Learning Distortion-Invariant Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)