Table of Contents
Fetching ...

Robust Self-Supervised Cross-Modal Super-Resolution against Real-World Misaligned Observations

Xiaoyu Dong, Jiahuan Li, Ziteng Cui, Naoto Yokoya

TL;DR

This work proposes RobSelf--a fully self-supervised model that is optimized online, requiring no training data, ground-truth supervision, or pre-alignment, and achieves state-of-the-art performance and superior efficiency.

Abstract

We study cross-modal super-resolution (SR) on real-world misaligned data, where only a limited number of low-resolution (LR) source and high-resolution (HR) guide image pairs with complex spatial misalignments are available. To address this challenge, we propose RobSelf--a fully self-supervised model that is optimized online, requiring no training data, ground-truth supervision, or pre-alignment. RobSelf features two key techniques: a misalignment-aware feature translator and a content-aware reference filter. The translator reformulates unsupervised cross-modal and cross-resolution alignment as a weakly-supervised, misalignment-aware translation subtask, producing an aligned guide feature with inherent redundancy. Guided by this feature, the filter performs reference-based discriminative self-enhancement on the source, enabling SR predictions with high resolution and high fidelity. Across a variety of tasks, we demonstrate that RobSelf achieves state-of-the-art performance and superior efficiency. Additionally, we introduce a real-world dataset, RealMisSR, to advance research on this topic. Dataset and code: https://github.com/palmdong/RobSelf.

Robust Self-Supervised Cross-Modal Super-Resolution against Real-World Misaligned Observations

TL;DR

This work proposes RobSelf--a fully self-supervised model that is optimized online, requiring no training data, ground-truth supervision, or pre-alignment, and achieves state-of-the-art performance and superior efficiency.

Abstract

We study cross-modal super-resolution (SR) on real-world misaligned data, where only a limited number of low-resolution (LR) source and high-resolution (HR) guide image pairs with complex spatial misalignments are available. To address this challenge, we propose RobSelf--a fully self-supervised model that is optimized online, requiring no training data, ground-truth supervision, or pre-alignment. RobSelf features two key techniques: a misalignment-aware feature translator and a content-aware reference filter. The translator reformulates unsupervised cross-modal and cross-resolution alignment as a weakly-supervised, misalignment-aware translation subtask, producing an aligned guide feature with inherent redundancy. Guided by this feature, the filter performs reference-based discriminative self-enhancement on the source, enabling SR predictions with high resolution and high fidelity. Across a variety of tasks, we demonstrate that RobSelf achieves state-of-the-art performance and superior efficiency. Additionally, we introduce a real-world dataset, RealMisSR, to advance research on this topic. Dataset and code: https://github.com/palmdong/RobSelf.
Paper Structure (18 sections, 6 equations, 9 figures, 5 tables)

This paper contains 18 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Real-world misaligned RGB-guided depth SR ($\times 4$). Our model achieves state-of-the-art performance without the need for training data, ground-truth supervision, or pre-alignment. (a) LR source; (b) HR guide; (c) pre-aligned guide by MINIMA MINIMA_2025_CVPR; (d) SSGNet SSGNet_2023_AAAI + pre-alignment; (e) SGNet SGNet_2024_AAAI + pre-alignment; (f) RobSelf-Re (Ours).
  • Figure 2: RobSelf is supervised solely by the LR source. Within this framework, the translator is weakly supervised to translate the guide feature into an HR prediction that mimics the source modality, while deriving an aligned guide feature. Guided by this feature, the filter performs reference-based discriminative self-enhancement on the source feature, from which the SR prediction is generated. Dashed flows exist only during optimization. RobSelf has two variants depending on the alignment layer of the translator (\ref{['subsec:translator']}).
  • Figure 3: Misalignment-aware feature translator. Each encoder layer downsamples by $\times 2$; each decoder layer upsamples by $\times 2$.
  • Figure 4: Content-aware reference filter.
  • Figure 5: Example data from our RealMisSR dataset. LR sources are overlaid on HR guides for better visualization of misalignments.
  • ...and 4 more figures