Table of Contents
Fetching ...

A Practical Gated Recurrent Transformer Network Incorporating Multiple Fusions for Video Denoising

Kai Guo, Seungwon Choi, Jongseong Choi, Lae-Hoon Kim

TL;DR

A multi-fusion gated recurrent Transformer network (GRTN) that achieves SOTA denoising performance with only a single-frame delay, and to robustly compute attention for noisy features, is proposed.

Abstract

State-of-the-art (SOTA) video denoising methods employ multi-frame simultaneous denoising mechanisms, resulting in significant delays (e.g., 16 frames), making them impractical for real-time cameras. To overcome this limitation, we propose a multi-fusion gated recurrent Transformer network (GRTN) that achieves SOTA denoising performance with only a single-frame delay. Specifically, the spatial denoising module extracts features from the current frame, while the reset gate selects relevant information from the previous frame and fuses it with current frame features via the temporal denoising module. The update gate then further blends this result with the previous frame features, and the reconstruction module integrates it with the current frame. To robustly compute attention for noisy features, we propose a residual simplified Swin Transformer with Euclidean distance (RSSTE) in the spatial and temporal denoising modules. Comparative objective and subjective results show that our GRTN achieves denoising performance comparable to SOTA multi-frame delay networks, with only a single-frame delay.

A Practical Gated Recurrent Transformer Network Incorporating Multiple Fusions for Video Denoising

TL;DR

A multi-fusion gated recurrent Transformer network (GRTN) that achieves SOTA denoising performance with only a single-frame delay, and to robustly compute attention for noisy features, is proposed.

Abstract

State-of-the-art (SOTA) video denoising methods employ multi-frame simultaneous denoising mechanisms, resulting in significant delays (e.g., 16 frames), making them impractical for real-time cameras. To overcome this limitation, we propose a multi-fusion gated recurrent Transformer network (GRTN) that achieves SOTA denoising performance with only a single-frame delay. Specifically, the spatial denoising module extracts features from the current frame, while the reset gate selects relevant information from the previous frame and fuses it with current frame features via the temporal denoising module. The update gate then further blends this result with the previous frame features, and the reconstruction module integrates it with the current frame. To robustly compute attention for noisy features, we propose a residual simplified Swin Transformer with Euclidean distance (RSSTE) in the spatial and temporal denoising modules. Comparative objective and subjective results show that our GRTN achieves denoising performance comparable to SOTA multi-frame delay networks, with only a single-frame delay.
Paper Structure (10 sections, 15 equations, 5 figures, 2 tables)

This paper contains 10 sections, 15 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The proposed GRTN offers denoising performance comparable to SOTA multi-frame delay networks, but with only a single-frame delay. In contrast, VLNBArias18:JMIV, DVDNetTassano19:ICIP and FastDVDNetTassano20:CVPR have a 3-frame delay, while PaCNetVaksman21:CVPR and RVRTLiang22:NIPS have 4- and 16-frame delays, respectively. The Set8 dataset Tassano19:ICIP with Gaussian noise ($\sigma=50$) is used in this evaluation.
  • Figure 2: The detailed network architecture of the proposed GRTN. GDA refers to guided deformable alignment Liang22:NIPS.
  • Figure 3: Comparison of attention maps using dot product and Euclidean distance. (a) and (d) show a noise-free image (cropped from Lenna) and the same image with Gaussian noise ($\sigma=50$), respectively, with the central $9{\times}9$ patch highlighted in red. (b) and (e) display dot product-based attention maps for the central patch, calculated from (a) and (d). (c) and (f) show the corresponding Euclidean distance-based attention maps.
  • Figure 4: (a) Residual simplified Swin Transformer with Euclidean attention (RSSTE). (b) Simplified Swin Transformer with Euclidean attention (SSTE).
  • Figure 5: Video denoising comparison ($\sigma=50$) on Set8Tassano19:ICIP and DAVISKhoreva18:ACCV.