Table of Contents
Fetching ...

XPoint: A Self-Supervised Visual-State-Space based Architecture for Multispectral Image Registration

Ismail Can Yagmur, Hasan F. Ates, Bahadir K. Gunturk

TL;DR

XPoint is introduced, a self-supervised, modular image-matching framework designed for adaptive training and fine-tuning on aligned multispectral datasets, allowing users to customize key components based on their specific tasks.

Abstract

Accurate multispectral image matching presents significant challenges due to non-linear intensity variations across spectral modalities, extreme viewpoint changes, and the scarcity of labeled datasets. Current state-of-the-art methods are typically specialized for a single spectral difference, such as visibleinfrared, and struggle to adapt to other modalities due to their reliance on expensive supervision, such as depth maps or camera poses. To address the need for rapid adaptation across modalities, we introduce XPoint, a self-supervised, modular image-matching framework designed for adaptive training and fine-tuning on aligned multispectral datasets, allowing users to customize key components based on their specific tasks. XPoint employs modularity and self-supervision to allow for the adjustment of elements such as the base detector, which generates pseudoground truth keypoints invariant to viewpoint and spectrum variations. The framework integrates a VMamba encoder, pretrained on segmentation tasks, for robust feature extraction, and includes three joint decoder heads: two are dedicated to interest point and descriptor extraction; and a task-specific homography regression head imposes geometric constraints for superior performance in tasks like image registration. This flexible architecture enables quick adaptation to a wide range of modalities, demonstrated by training on Optical-Thermal data and fine-tuning on settings such as visual-near infrared, visual-infrared, visual-longwave infrared, and visual-synthetic aperture radar. Experimental results show that XPoint consistently outperforms or matches state-ofthe-art methods in feature matching and image registration tasks across five distinct multispectral datasets. Our source code is available at https://github.com/canyagmur/XPoint.

XPoint: A Self-Supervised Visual-State-Space based Architecture for Multispectral Image Registration

TL;DR

XPoint is introduced, a self-supervised, modular image-matching framework designed for adaptive training and fine-tuning on aligned multispectral datasets, allowing users to customize key components based on their specific tasks.

Abstract

Accurate multispectral image matching presents significant challenges due to non-linear intensity variations across spectral modalities, extreme viewpoint changes, and the scarcity of labeled datasets. Current state-of-the-art methods are typically specialized for a single spectral difference, such as visibleinfrared, and struggle to adapt to other modalities due to their reliance on expensive supervision, such as depth maps or camera poses. To address the need for rapid adaptation across modalities, we introduce XPoint, a self-supervised, modular image-matching framework designed for adaptive training and fine-tuning on aligned multispectral datasets, allowing users to customize key components based on their specific tasks. XPoint employs modularity and self-supervision to allow for the adjustment of elements such as the base detector, which generates pseudoground truth keypoints invariant to viewpoint and spectrum variations. The framework integrates a VMamba encoder, pretrained on segmentation tasks, for robust feature extraction, and includes three joint decoder heads: two are dedicated to interest point and descriptor extraction; and a task-specific homography regression head imposes geometric constraints for superior performance in tasks like image registration. This flexible architecture enables quick adaptation to a wide range of modalities, demonstrated by training on Optical-Thermal data and fine-tuning on settings such as visual-near infrared, visual-infrared, visual-longwave infrared, and visual-synthetic aperture radar. Experimental results show that XPoint consistently outperforms or matches state-ofthe-art methods in feature matching and image registration tasks across five distinct multispectral datasets. Our source code is available at https://github.com/canyagmur/XPoint.

Paper Structure

This paper contains 29 sections, 13 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Architecture of XPoint. (A) Self-supervision stage uses improved multispectral homographic adaptation with spectrum-aware keypoint acceptance and the RIFT2 detector to create multispectral pseudo-ground truth keypoints. (B) Training stage combines pretrained VMamba encoders with SS2D for enhanced feature extraction, incorporating interest point and descriptor decoders and a homography regression head for improved matching and homography estimation. (C) In inference stage, shared encoders extract features from multispectral pairs, enabling joint keypoint detection, descriptor extraction, and robust outlier removal for accurate correspondences and homography estimation.
  • Figure 2: Random homographies are generated by combining simple transformations such as translation, scaling, rotation, and symmetric perspective distortion, sampled within predefined ranges.
  • Figure 3: Adopted VMamba Encoder Architecture.
  • Figure 4: Interest Point and Descriptor Decoders. The interest point decoder outputs keypoint heatmaps, while the descriptor decoder generates dense descriptors for the input image.
  • Figure 5: Homography Head Architecture.
  • ...and 6 more figures