Efficient Hybrid Zoom using Camera Fusion on Mobile Phones

Xiaotong Wu; Wei-Sheng Lai; YiChang Shih; Charles Herrmann; Michael Krainin; Deqing Sun; Chia-Kai Liang

Efficient Hybrid Zoom using Camera Fusion on Mobile Phones

Xiaotong Wu, Wei-Sheng Lai, YiChang Shih, Charles Herrmann, Michael Krainin, Deqing Sun, Chia-Kai Liang

TL;DR

The paper tackles the challenge of achieving high-quality zoom on mobile devices by leveraging a synchronized Wide and Telephoto capture and a hybrid zoom super-resolution pipeline. It proposes efficient on-device alignment, a Fusion UNet for detail transfer, and a multi-map adaptive blending strategy to handle DoF, occlusion, and alignment errors. A dual-phone rig-based training regime and the Hzsr dataset address domain gaps and data realism, yielding robust performance on real-world scenes. Empirical results show interactive 12MP outputs on mobile and strong advantages over existing RefSR methods on public benchmarks and the new Hzsr dataset, highlighting practical impact for consumer devices. The work advances computational photography by combining hardware-aware optimization with robust learning-based fusion for mobile hybrid zoom.

Abstract

DSLR cameras can achieve multiple zoom levels via shifting lens distances or swapping lens types. However, these techniques are not possible on smartphone devices due to space constraints. Most smartphone manufacturers adopt a hybrid zoom system: commonly a Wide (W) camera at a low zoom level and a Telephoto (T) camera at a high zoom level. To simulate zoom levels between W and T, these systems crop and digitally upsample images from W, leading to significant detail loss. In this paper, we propose an efficient system for hybrid zoom super-resolution on mobile devices, which captures a synchronous pair of W and T shots and leverages machine learning models to align and transfer details from T to W. We further develop an adaptive blending method that accounts for depth-of-field mismatches, scene occlusion, flow uncertainty, and alignment errors. To minimize the domain gap, we design a dual-phone camera rig to capture real-world inputs and ground-truths for supervised training. Our method generates a 12-megapixel image in 500ms on a mobile platform and compares favorably against state-of-the-art methods under extensive evaluation on real-world scenarios.

Efficient Hybrid Zoom using Camera Fusion on Mobile Phones

TL;DR

Abstract

Paper Structure (42 sections, 10 equations, 21 figures, 3 tables)

This paper contains 42 sections, 10 equations, 21 figures, 3 tables.

Introduction
Efficient processing on mobile devices
Adapting to imperfect references
Minimizing domain gap with real-world inputs
Related Work
Learning-based SISR
RefSR using Internet images
RefSR using auxiliary cameras
Efficient mobile RefSR
Hybrid Zoom Super-Resolution
Image Alignment
Coarse alignment
Dense alignment
Image Fusion
Adaptive Blending
...and 27 more sections

Figures (21)

Figure 1: Detail improvements in hybrid zoom. The red dotted lines mark the FOV of $3\times$ zoom on $1\times$ wide (W) camera, while the green dotted lines mark the FOV of $5\times$ telephoto (T) camera. Image quality at an intermediate zoom range suffers from blurry details from single-image super-resolution romano2016raisr. Our mobile hybrid zoom super-resolution (HZSR) system captures a synchronous pair of W and T and fuses details through efficient ML models and adaptive blending. Our fusion results significantly improve texture clarity when compared to the upsampled W.
Figure 2: When depth-of-field (DoF) is shallower on telephoto (T) than wide (W), transferring details from T to W in defocus regions results in significant artifacts. We design our system to exclude defocus regions during fusion and generate results that are robust to lens DoF. By contrast, the result from DCSR wang2021dual shows blurrier details than the input W on the parrot and building.
Figure 3: System overview. Given concurrently captured W and T images, we crop W to match the FOV of T, coarsely align them via feature matching, and adjust the color of T to match W. The cropped W and adjusted T are referred to as source and reference, respectively. Then, we estimate dense optical flow to align the reference to source (Sec. \ref{['sec:alignment']}) and generate an occlusion mask. Our Fusion UNet takes as input the source, warped reference, and occlusion mask for detail fusion (Sec. \ref{['sec:fusion']}). Lastly, we merge the fusion result back to the full W image via an adaptive blending (Sec. \ref{['sec:blending']}, Fig. \ref{['fig:blending']}) as the final output.
Figure 4: Adaptive blending. We use alpha masks to make the fusion robust to alignment errors and DoF mismatch (Sec. \ref{['sec:blending']}).
Figure 5: Efficient defocus map detection using optical flow at the alignment stage, described in Sec. \ref{['sec:blending']}. Black/white pixels in the defocus map represent the focused/defocused area.
...and 16 more figures

Efficient Hybrid Zoom using Camera Fusion on Mobile Phones

TL;DR

Abstract

Efficient Hybrid Zoom using Camera Fusion on Mobile Phones

Authors

TL;DR

Abstract

Table of Contents

Figures (21)