Table of Contents
Fetching ...

Learning Accurate and Enriched Features for Stereo Image Super-Resolution

Hu Gao, Depeng Dang

TL;DR

A mixed-scale selective fusion network (MSSFNet) is proposed to preserve precise spatial details and incorporate abundant contextual information, and adaptively select and fuse most accurate features from two views to enhance the promotion of high-quality stereoSR.

Abstract

Stereo image super-resolution (stereoSR) aims to enhance the quality of super-resolution results by incorporating complementary information from an alternative view. Although current methods have shown significant advancements, they typically operate on representations at full resolution to preserve spatial details, facing challenges in accurately capturing contextual information. Simultaneously, they utilize all feature similarities to cross-fuse information from the two views, potentially disregarding the impact of irrelevant information. To overcome this problem, we propose a mixed-scale selective fusion network (MSSFNet) to preserve precise spatial details and incorporate abundant contextual information, and adaptively select and fuse most accurate features from two views to enhance the promotion of high-quality stereoSR. Specifically, we develop a mixed-scale block (MSB) that obtains contextually enriched feature representations across multiple spatial scales while preserving precise spatial details. Furthermore, to dynamically retain the most essential cross-view information, we design a selective fusion attention module (SFAM) that searches and transfers the most accurate features from another view. To learn an enriched set of local and non-local features, we introduce a fast fourier convolution block (FFCB) to explicitly integrate frequency domain knowledge. Extensive experiments show that MSSFNet achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations.

Learning Accurate and Enriched Features for Stereo Image Super-Resolution

TL;DR

A mixed-scale selective fusion network (MSSFNet) is proposed to preserve precise spatial details and incorporate abundant contextual information, and adaptively select and fuse most accurate features from two views to enhance the promotion of high-quality stereoSR.

Abstract

Stereo image super-resolution (stereoSR) aims to enhance the quality of super-resolution results by incorporating complementary information from an alternative view. Although current methods have shown significant advancements, they typically operate on representations at full resolution to preserve spatial details, facing challenges in accurately capturing contextual information. Simultaneously, they utilize all feature similarities to cross-fuse information from the two views, potentially disregarding the impact of irrelevant information. To overcome this problem, we propose a mixed-scale selective fusion network (MSSFNet) to preserve precise spatial details and incorporate abundant contextual information, and adaptively select and fuse most accurate features from two views to enhance the promotion of high-quality stereoSR. Specifically, we develop a mixed-scale block (MSB) that obtains contextually enriched feature representations across multiple spatial scales while preserving precise spatial details. Furthermore, to dynamically retain the most essential cross-view information, we design a selective fusion attention module (SFAM) that searches and transfers the most accurate features from another view. To learn an enriched set of local and non-local features, we introduce a fast fourier convolution block (FFCB) to explicitly integrate frequency domain knowledge. Extensive experiments show that MSSFNet achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations.
Paper Structure (14 sections, 13 equations, 8 figures, 8 tables)

This paper contains 14 sections, 13 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison between our method and existing methods. Existing methods utilize all attention relations based on another viewpoint to aggregate features. Our method searches and transfers the most accurate features from another view to reduce the distraction of irrelevant information.
  • Figure 2: Computational cost vs. PSNR between our MSSFNet and other state-of-the-art algorithms for 4× stereo SR on Flickr 1024 flickr1024wang2019learning. (a) Our MSSFNet achieve the SOTA performance with fewer FLOPs. (b) The total number of parameters vs. PSNR. Our MSSFNet achieve the best performance with up to 87% of parameter reduction.
  • Figure 3: The overall architecture of MSSFNet with two key conponent: (1) mixed-scale block (MSB) (illustrated in Figure. \ref{['fig:conponent']}(a) ), fast fourier convolution block (FFCB) (depicted in Figure. \ref{['fig:conponent']}(c)) and selective fusion attention module (SFAM) (depicted in Figure. \ref{['fig:conponent']}(d))
  • Figure 4: (a) Mixed-scale block (MSB). (b) Simplified Channel Attention (SCA). (c) Fast fourier convolution block (FFCB). (d) Selective fusion attention module (SFAM).
  • Figure 5: Visual results ($\times 2$) achieved by different methods on the Flickr1024 dataset flickr1024wang2019learning.
  • ...and 3 more figures