Cross-attention learning enables real-time nonuniform rotational distortion correction in OCT
Haoran Zhang, Jianlong Yang, Jingqian Zhang, Shiqing Zhao, Aili Zhang
TL;DR
This work addresses the real-time correction of nonuniform rotational distortion (NURD) in endoscopic OCT by introducing a stacked cross-attention network that models long-range dependencies between OCT A-lines. Framed as a self-supervised problem, the method predicts bi-directional A-line distortions and uses a cumulative transform to correct frames, guided by a combined loss with $\mathcal{L}_1$, $\mathcal{L}_{sm}$, and $\mathcal{L}_{si}$. Across synthetic and real datasets, it achieves superior correction accuracy and robustness while delivering real-time performance (~$26\pm3$ fps), outperforming feature-based, DP, and CNN-based baselines. The approach demonstrates strong potential for real-time OCT applications in surgical navigation and functional imaging, with discussions on data requirements and domain generalization.
Abstract
Nonuniform rotational distortion (NURD) correction is vital for endoscopic optical coherence tomography (OCT) imaging and its functional extensions, such as angiography and elastography. Current NURD correction methods require time-consuming feature tracking or cross-correlation calculations and thus sacrifice temporal resolution. Here we propose a cross-attention learning method for the NURD correction in OCT. Our method is inspired by the recent success of the self-attention mechanism in natural language processing and computer vision. By leveraging its ability to model long-range dependencies, we can directly obtain the correlation between OCT A-lines at any distance, thus accelerating the NURD correction. We develop an end-to-end stacked cross-attention network and design three types of optimization constraints. We compare our method with two traditional feature-based methods and a CNN-based method, on two publicly-available endoscopic OCT datasets and a private dataset collected on our home-built endoscopic OCT system. Our method achieved a $\sim3\times$ speedup to real time ($26\pm 3$ fps), and superior correction performance.
