Table of Contents
Fetching ...

MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models

Wontae Choi, Jaelin Lee, Hyung Sup Yun, Byeungwoo Jeon, Il Yong Chun

TL;DR

MoTDiff introduces a high-resolution motion trajectory estimator that operates directly on a single motion-blurred image using a conditional diffusion framework. By leveraging multi-scale features from a Pyramid Vision Transformer and a stepwise feature aggregation strategy, it conditions a lightweight diffusion denoiser to produce a dense $256\times256$ motion trajectory map, further enhanced by a training loss that combines weighted BCE and IoU and a connectivity-promoting STPD method. The approach yields state-of-the-art gains in blind image deblurring and coded exposure photography, demonstrated on synthetic GoPro-derived data and real RSBlur images, with ablations validating the contribution of multi-scale conditioning, the novel loss, and STPD. This work advances motion representation fidelity, enabling more accurate PSF modeling and more effective code optimization in CEP applications, with potential for end-to-end task integration in future work.

Abstract

Accurate estimation of motion information is crucial in diverse computational imaging and computer vision applications. Researchers have investigated various methods to extract motion information from a single blurred image, including blur kernels and optical flow. However, existing motion representations are often of low quality, i.e., coarse-grained and inaccurate. In this paper, we propose the first high-resolution (HR) Motion Trajectory estimation framework using Diffusion models (MoTDiff). Different from existing motion representations, we aim to estimate an HR motion trajectory with high-quality from a single motion-blurred image. The proposed MoTDiff consists of two key components: 1) a new conditional diffusion framework that uses multi-scale feature maps extracted from a single blurred image as a condition, and 2) a new training method that can promote precise identification of a fine-grained motion trajectory, consistent estimation of overall shape and position of a motion path, and pixel connectivity along a motion trajectory. Our experiments demonstrate that the proposed MoTDiff can outperform state-of-the-art methods in both blind image deblurring and coded exposure photography applications.

MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models

TL;DR

MoTDiff introduces a high-resolution motion trajectory estimator that operates directly on a single motion-blurred image using a conditional diffusion framework. By leveraging multi-scale features from a Pyramid Vision Transformer and a stepwise feature aggregation strategy, it conditions a lightweight diffusion denoiser to produce a dense motion trajectory map, further enhanced by a training loss that combines weighted BCE and IoU and a connectivity-promoting STPD method. The approach yields state-of-the-art gains in blind image deblurring and coded exposure photography, demonstrated on synthetic GoPro-derived data and real RSBlur images, with ablations validating the contribution of multi-scale conditioning, the novel loss, and STPD. This work advances motion representation fidelity, enabling more accurate PSF modeling and more effective code optimization in CEP applications, with potential for end-to-end task integration in future work.

Abstract

Accurate estimation of motion information is crucial in diverse computational imaging and computer vision applications. Researchers have investigated various methods to extract motion information from a single blurred image, including blur kernels and optical flow. However, existing motion representations are often of low quality, i.e., coarse-grained and inaccurate. In this paper, we propose the first high-resolution (HR) Motion Trajectory estimation framework using Diffusion models (MoTDiff). Different from existing motion representations, we aim to estimate an HR motion trajectory with high-quality from a single motion-blurred image. The proposed MoTDiff consists of two key components: 1) a new conditional diffusion framework that uses multi-scale feature maps extracted from a single blurred image as a condition, and 2) a new training method that can promote precise identification of a fine-grained motion trajectory, consistent estimation of overall shape and position of a motion path, and pixel connectivity along a motion trajectory. Our experiments demonstrate that the proposed MoTDiff can outperform state-of-the-art methods in both blind image deblurring and coded exposure photography applications.

Paper Structure

This paper contains 23 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustrations of different motion trajectory representations for the same 2D motion path. (a) A set of trajectory positions trajmodeling (continuous space). (b) PSF blind-dpskernel-diff (discrete space with $64\!\times\!64$ pixels). (c) Parametric trajectory ETR (quadratic curve constraint; continuous space). (d) Proposed HR trajectory (discrete space with $256\!\times\!256$ pixels).
  • Figure 2: Overview of proposed MoTDiff in the reverse diffusion process. To extract a condition in MoTDiff, we first extract multi-scale feature maps $\{ \mathbf{f}_s \}$ from a single blurred image $\mathbf{b}$ using PVT. We then enhance salient motion features in $\{ \mathbf{f}_s \}$ ( local emphasis), and progressively integrate global and local motion features $\{ \mathbf{f}^{\text{up}}_s \}$ across the feature hierarchy ( stepwise feature aggregation). We use the aggregated feature map $\mathbf{z}_1$ as a condition for a diffusion denoiser that gives a motion trajectory estimate $\hat{\mathbf{x}}_0$ from noisy trajectory $\mathbf{x}_t$ at sampling timestep $t$, $\forall t$. We train MoTDiff using loss functions that compare an estimated trajectory $\hat{\mathbf{x}}_0$ with the ground truth $\mathbf{x}_0$, for uniformly randomly sampled timesteps.
  • Figure 3: Visualizations of motion features with different levels of understanding captured via PVT (PVT Stages 1 & 4). (a) An input motion-blurred image to the PVT encoder ($256\times256$ pixels). (b) Ground-truth HR motion trajectory ($256\times256$ pixels). (c) A low-level motion feature from PVT Stage 1 ($64\times64$ pixels; LE applied). (d) A high-level motion feature from PVT Stage 4 ($8\times8$ pixels; LE applied).
  • Figure 4: Comparisons of deblurred images and estimated PSFs from different blind image deblurring methods (the inset in the top-left corner displays ground truth or estimated PSF; we used the synthetic GoPro dataset in Section \ref{['sec:data']}). The proposed MoTDiff can give significantly better motion trajectories and deblurred images compared to the several SOTA blind image deblurring methods.
  • Figure 5: Comparisons of deblurred images and estimated PSFs from different blind image deblurring methods (the inset in the top-left corner displays estimated PSF; we used the real-world RSBlur dataset in Section \ref{['sec:data']}).
  • ...and 1 more figures