Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration

Yiwen Wang; Jiahao Qin

Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration

Yiwen Wang, Jiahao Qin

TL;DR

This paper addresses bidirectional OR-PAM registration, where forward and backward scans exhibit domain shifts that undermine traditional brightness-constancy and deformation-based methods. It introduces GPEReg-Net, a deformation-free approach that disentangles a scene representation from a global appearance code and reconstructs registered outputs via Adaptive Instance Normalization, while leveraging temporal context through a Global Position Encoding module that fuses learnable and sinusoidal frame embeddings with cross-frame attention. On OR-PAM-Reg-4K, it achieves state-of-the-art SSIM and PSNR with competitive NCC, and runs at real-time speeds, albeit with some limitations on longer temporal sequences. Temporal-consistency evaluation on a 26K dataset shows strong cross-frame coherence, but reveals constraints of fixed-capacity position embeddings, motivating future work on adaptive encoding and spatially-aware appearance modeling for better robustness in longitudinal imaging.

Abstract

High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing registration methods, constrained by brightness constancy assumptions, achieve limited alignment quality, while recent generative approaches address domain shift through complex architectures that lack temporal awareness across frames. We propose GPEReg-Net, a scene-appearance disentanglement framework that separates domain-invariant scene features from domain-specific appearance codes via Adaptive Instance Normalization (AdaIN), enabling direct image-to-image registration without explicit deformation field estimation. To exploit temporal structure in sequential acquisitions, we introduce a Global Position Encoding (GPE) module that combines learnable position embeddings with sinusoidal encoding and cross-frame attention, allowing the network to leverage context from neighboring frames for improved temporal coherence. On the OR-PAM-Reg-4K benchmark (432 test samples), GPEReg-Net achieves NCC of 0.953, SSIM of 0.932, and PSNR of 34.49dB, surpassing the state-of-the-art by 3.8% in SSIM and 1.99dB in PSNR while maintaining competitive NCC. Code is available at https://github.com/JiahaoQin/GPEReg-Net.

Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 5 figures, 3 tables)

This paper contains 25 sections, 6 equations, 5 figures, 3 tables.

Introduction
Method
Overview
Scene Encoder
Appearance Encoder
Image Decoder with AdaIN Modulation
Global Position Encoding
Learnable position embedding.
Sinusoidal position encoding.
Fusion MLP.
Cross-frame attention.
Position-to-scene projection.
Loss Function
Experiments
Datasets and Implementation Details
...and 10 more sections

Figures (5)

Figure 1: Overview of the GPEReg-Net framework. The SceneEncoder extracts domain-invariant features $\bm{s}$ from the moving image using instance normalization and skip connections. The AppearanceEncoder extracts a 32-dim global appearance code $\bm{a}$ from the fixed image. The GPE module enriches scene features with temporal context from neighboring frames via learnable position embeddings, sinusoidal encoding, and cross-frame attention. The ImageDecoder uses AdaIN to modulate scene features with the target appearance at each decoding level, producing the registered output $\hat{I}_r$.
Figure 2: Illustration of the GPE temporal mechanism. For each frame $t$ in the temporal sequence, the Scene Encoder extracts domain-invariant features $\mathbf{s}_t$. The GPE module enriches $\mathbf{s}_t$ with temporal context from neighboring frames via learnable and sinusoidal position encodings fused through an MLP, followed by multi-head cross-frame attention ($H\!=\!4$, $k\!=\!2$) over a running neighbor cache. The position-aware features $\tilde{\mathbf{s}}_t$ are then decoded with AdaIN modulation to produce temporally coherent registered outputs.
Figure 3: Qualitative comparison of registration results on OR-PAM-Reg-4K. Each column shows one test sample. From top to bottom: fixed image (odd columns), moving image (even columns), and registered outputs from each method. GPEReg-Net produces visually sharper results with better structural preservation compared to deformation-based methods, and comparable quality to SAS-Net.
Figure 4: Distribution of registration metrics on OR-PAM-Reg-4K (432 test samples). Box plots show the median, interquartile range, and outliers for NCC, SSIM, and PSNR across all methods. GPEReg-Net exhibits tight distributions with high medians, confirming consistent registration quality across diverse test samples.
Figure 5: Registration quality visualization on OR-PAM-Reg-Temporal-26K. The first row shows the fixed image (odd columns) as reference. Subsequent rows show registered even columns only (the actual model output) for each method across six consecutive frames. Intra-frame NCC values (registered output vs. fixed) are annotated per cell. Bottom rows show absolute difference maps $|\text{output} - \text{fixed}|$. GPEReg-Net and SAS-Net closely match the fixed image appearance (NCC $\geq$ 0.995), while TransMorph and the unregistered baseline exhibit visible domain shift.

Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration

TL;DR

Abstract

Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration

Authors

TL;DR

Abstract

Table of Contents

Figures (5)