Table of Contents
Fetching ...

Motion-Corrected Moving Average: Including Post-Hoc Temporal Information for Improved Video Segmentation

Robert Mendel, Tobias Rueckert, Dirk Wilhelm, Daniel Rueckert, Christoph Palm

TL;DR

Motion-Corrected Moving Average (MCMA) introduces temporal information into video segmentation at inference time without modifying training or architecture. By aligning past feature representations using optical-flow-based motion warping and combining them with current features in a feature-space exponential moving average, MCMA reduces inter-frame noise while preserving real-time performance. Across Barrett, EndoVis-2019, Cholec, and Cityscapes, MCMA yields consistent IoU gains over single-frame and EMA baselines, with notable improvements on high-motion frames and only modest runtime overhead due to parallelized flow and low-resolution options. The approach is flexible, hardware-friendly, and applicable to diverse segmentation tasks, with future work focusing on adaptive per-pixel smoothing and dynamic motion scaling.

Abstract

Real-time computational speed and a high degree of precision are requirements for computer-assisted interventions. Applying a segmentation network to a medical video processing task can introduce significant inter-frame prediction noise. Existing approaches can reduce inconsistencies by including temporal information but often impose requirements on the architecture or dataset. This paper proposes a method to include temporal information in any segmentation model and, thus, a technique to improve video segmentation performance without alterations during training or additional labeling. With Motion-Corrected Moving Average, we refine the exponential moving average between the current and previous predictions. Using optical flow to estimate the movement between consecutive frames, we can shift the prior term in the moving-average calculation to align with the geometry of the current frame. The optical flow calculation does not require the output of the model and can therefore be performed in parallel, leading to no significant runtime penalty for our approach. We evaluate our approach on two publicly available segmentation datasets and two proprietary endoscopic datasets and show improvements over a baseline approach.

Motion-Corrected Moving Average: Including Post-Hoc Temporal Information for Improved Video Segmentation

TL;DR

Motion-Corrected Moving Average (MCMA) introduces temporal information into video segmentation at inference time without modifying training or architecture. By aligning past feature representations using optical-flow-based motion warping and combining them with current features in a feature-space exponential moving average, MCMA reduces inter-frame noise while preserving real-time performance. Across Barrett, EndoVis-2019, Cholec, and Cityscapes, MCMA yields consistent IoU gains over single-frame and EMA baselines, with notable improvements on high-motion frames and only modest runtime overhead due to parallelized flow and low-resolution options. The approach is flexible, hardware-friendly, and applicable to diverse segmentation tasks, with future work focusing on adaptive per-pixel smoothing and dynamic motion scaling.

Abstract

Real-time computational speed and a high degree of precision are requirements for computer-assisted interventions. Applying a segmentation network to a medical video processing task can introduce significant inter-frame prediction noise. Existing approaches can reduce inconsistencies by including temporal information but often impose requirements on the architecture or dataset. This paper proposes a method to include temporal information in any segmentation model and, thus, a technique to improve video segmentation performance without alterations during training or additional labeling. With Motion-Corrected Moving Average, we refine the exponential moving average between the current and previous predictions. Using optical flow to estimate the movement between consecutive frames, we can shift the prior term in the moving-average calculation to align with the geometry of the current frame. The optical flow calculation does not require the output of the model and can therefore be performed in parallel, leading to no significant runtime penalty for our approach. We evaluate our approach on two publicly available segmentation datasets and two proprietary endoscopic datasets and show improvements over a baseline approach.
Paper Structure (23 sections, 2 equations, 6 figures, 3 tables)

This paper contains 23 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Inference with MCMA. Features extracted by the encoder on the current frame, the estimated displacement field, and the results of the previous iteration are merged in the MCMA module. The result is passed to the decoder layer to predict the segmentation and as input for the MCMA calculation in the next frame.
  • Figure 2: Example source images with corresponding segmentation masks from the EndoVis-2019 ross2021comparative and Cholec datasets.
  • Figure 3: Comparison between the baseline, EMA, and MCMA on consecutive frames of the the Barretts dataset. The image-based segmentation can track the non-dysplastic Barrets (blue) frame by frame but introduces visual noise. With a low $\alpha$ value, EMA loses the segmentation during fast movements. MCMA, on the other hand, suppresses visual noise while accurately tracking the relevant region.
  • Figure 4: Effects of changing $\alpha$ for EMA and MCMA on the Barrett dataset. With a slow moving average, MCMA can achieve the most accurate results, while the quality with just EMA deteriorates. The more focus is put on the recent frames, the smaller the difference between all three approaches becomes.
  • Figure 5: Runtime performance of MCMA, segmented into the optical flow component and the warping and EMA calculation (eq. (\ref{['eq:mcma']})) for varying resolution scales. The computation time for both parts benefits from a lower resolution. Results were obtained on an Nvidia RTX 3090.
  • ...and 1 more figures