Table of Contents
Fetching ...

ASteISR: Adapting Single Image Super-resolution Pre-trained Model for Efficient Stereo Image Super-resolution

Yuanbo Zhou, Yuyang Xue, Wei Deng, Xinlin Zhang, Qinquan Gao, Tong Tong

TL;DR

This work addresses the resource burden of large pre-trained SISR transformers and their limited applicability to stereo SR. It proposes ASteISR, which freezes the SISR backbone and trains two adapters—stereo and spatial—using parameter-efficient fine-tuning, resulting in PSNR gains up to $0.79\,\mathrm{dB}$ on Flickr1024 and state-of-the-art SteISR performance across four benchmarks while using only about $3\%$ of the parameters. Compared with full fine-tuning, training time and memory consumption are reduced by about $57\%$ and $15\%$, respectively. The method demonstrates that single-image models can be effectively extended to multi-image domains with modest additional parameters, suggesting broad applicability to related tasks like video SR and other low-level vision problems.

Abstract

Despite advances in the paradigm of pre-training then fine-tuning in low-level vision tasks, significant challenges persist particularly regarding the increased size of pre-trained models such as memory usage and training time. Another concern often encountered is the unsatisfying results yielded when directly applying pre-trained single-image models to multi-image domain. In this paper, we propose a efficient method for transferring a pre-trained single-image super-resolution (SISR) transformer network to the domain of stereo image super-resolution (SteISR) through a parameter-efficient fine-tuning (PEFT) method. Specifically, we introduce the concept of stereo adapters and spatial adapters which are incorporated into the pre-trained SISR transformer network. Subsequently, the pre-trained SISR model is frozen, enabling us to fine-tune the adapters using stereo datasets along. By adopting this training method, we enhance the ability of the SISR model to accurately infer stereo images by 0.79dB on the Flickr1024 dataset. This method allows us to train only 4.8% of the original model parameters, achieving state-of-the-art performance on four commonly used SteISR benchmarks. Compared to the more complicated full fine-tuning approach, our method reduces training time and memory consumption by 57% and 15%, respectively.

ASteISR: Adapting Single Image Super-resolution Pre-trained Model for Efficient Stereo Image Super-resolution

TL;DR

This work addresses the resource burden of large pre-trained SISR transformers and their limited applicability to stereo SR. It proposes ASteISR, which freezes the SISR backbone and trains two adapters—stereo and spatial—using parameter-efficient fine-tuning, resulting in PSNR gains up to on Flickr1024 and state-of-the-art SteISR performance across four benchmarks while using only about of the parameters. Compared with full fine-tuning, training time and memory consumption are reduced by about and , respectively. The method demonstrates that single-image models can be effectively extended to multi-image domains with modest additional parameters, suggesting broad applicability to related tasks like video SR and other low-level vision problems.

Abstract

Despite advances in the paradigm of pre-training then fine-tuning in low-level vision tasks, significant challenges persist particularly regarding the increased size of pre-trained models such as memory usage and training time. Another concern often encountered is the unsatisfying results yielded when directly applying pre-trained single-image models to multi-image domain. In this paper, we propose a efficient method for transferring a pre-trained single-image super-resolution (SISR) transformer network to the domain of stereo image super-resolution (SteISR) through a parameter-efficient fine-tuning (PEFT) method. Specifically, we introduce the concept of stereo adapters and spatial adapters which are incorporated into the pre-trained SISR transformer network. Subsequently, the pre-trained SISR model is frozen, enabling us to fine-tune the adapters using stereo datasets along. By adopting this training method, we enhance the ability of the SISR model to accurately infer stereo images by 0.79dB on the Flickr1024 dataset. This method allows us to train only 4.8% of the original model parameters, achieving state-of-the-art performance on four commonly used SteISR benchmarks. Compared to the more complicated full fine-tuning approach, our method reduces training time and memory consumption by 57% and 15%, respectively.
Paper Structure (15 sections, 5 equations, 12 figures, 4 tables)

This paper contains 15 sections, 5 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Common training strategies in low-level vision include: (a) training from scratch, (b) full fine-tuning (pre-training then fine-tuning), and (c) parameter-efficient fine-tuning (Ours). (d) X2 results of SteISR on Flickr1024 dataset; (e) comparison of the tunable parameters, training time and memory.
  • Figure 2: The ASteISR framework, achieving SteISR by adding spatial and stereo adapters to the pre-trained SISR HAT model chen2023activating.
  • Figure 3: The principle idea of using stereo adapter. (a) Single image super-resolution (SISR) and stereo image super-resolution (SteISR) differ in their utilization of knowledge for restoration and reconstruction. SISR relies solely on intra-image knowledge, while SteISR leverages both intra-image and cross-image knowledge. (b) The stereo adapter, which is to integrate information from the left and right views.
  • Figure 4: By incorporating a spatial adapter (a) into the HAB (c), the SISR model can efficiently handle stereo images. The main distinction between HAB and the STL arises from the inclusion of a Channel Attention Block (CAB) in the HAB.
  • Figure 5: Visual results ($\times$4) achieved by different methods on the Flickr1024 wang2019learning dataset.
  • ...and 7 more figures