Table of Contents
Fetching ...

UniSER: A Foundation Model for Unified Soft Effects Removal

Jingdong Zhang, Lingzhi Zhang, Qing Liu, Mang Tik Chiu, Connelly Barnes, Yizhou Wang, Haoran You, Xiaoyang Liu, Yuqian Zhou, Zhe Lin, Eli Shechtman, Sohrab Amirghodsi, Xin Li, Wenping Wang, Xiaohang Zhan

TL;DR

UniSER tackles the problem of restoring images degraded by soft effects (lens flare, haze, shadows, reflections) by unifying them under a single foundation model. It combines a data-centric strategy with a Diffusion Transformer conditioned on both image context and textual prompts, augmented by a random mask strategy and a continuous strength control to learn robust restoration priors. A 3.8M-pair dataset across four degradation domains and physically grounded haze synthesis enable strong generalization and realistic training, achieving state-of-the-art results on public benchmarks and in-the-wild images while preserving scene identity. The approach also supports adding or enhancing effects, zero-shot generalization to unseen degradations, and precise pixel-level editing, offering a practical, controllable tool for photo restoration and downstream applications.

Abstract

Digital images are often degraded by soft effects such as lens flare, haze, shadows, and reflections, which reduce aesthetics even though the underlying pixels remain partially visible. The prevailing works address these degradations in isolation, developing highly specialized, specialist models that lack scalability and fail to exploit the shared underlying essences of these restoration problems. While specialist models are limited, recent large-scale pretrained generalist models offer powerful, text-driven image editing capabilities. while recent general-purpose systems (e.g., GPT-4o, Flux Kontext, Nano Banana) require detailed prompts and often fail to achieve robust removal on these fine-grained tasks or preserve identity of the scene. Leveraging the common essence of soft effects, i.e., semi-transparent occlusions, we introduce a foundational versatile model UniSER, capable of addressing diverse degradations caused by soft effects within a single framework. Our methodology centers on curating a massive 3.8M-pair dataset to ensure robustness and generalization, which includes novel, physically-plausible data to fill critical gaps in public benchmarks, and a tailored training pipeline that fine-tunes a Diffusion Transformer to learn robust restoration priors from this diverse data, integrating fine-grained mask and strength controls. This synergistic approach allows UniSER to significantly outperform both specialist and generalist models, achieving robust, high-fidelity restoration in the wild.

UniSER: A Foundation Model for Unified Soft Effects Removal

TL;DR

UniSER tackles the problem of restoring images degraded by soft effects (lens flare, haze, shadows, reflections) by unifying them under a single foundation model. It combines a data-centric strategy with a Diffusion Transformer conditioned on both image context and textual prompts, augmented by a random mask strategy and a continuous strength control to learn robust restoration priors. A 3.8M-pair dataset across four degradation domains and physically grounded haze synthesis enable strong generalization and realistic training, achieving state-of-the-art results on public benchmarks and in-the-wild images while preserving scene identity. The approach also supports adding or enhancing effects, zero-shot generalization to unseen degradations, and precise pixel-level editing, offering a practical, controllable tool for photo restoration and downstream applications.

Abstract

Digital images are often degraded by soft effects such as lens flare, haze, shadows, and reflections, which reduce aesthetics even though the underlying pixels remain partially visible. The prevailing works address these degradations in isolation, developing highly specialized, specialist models that lack scalability and fail to exploit the shared underlying essences of these restoration problems. While specialist models are limited, recent large-scale pretrained generalist models offer powerful, text-driven image editing capabilities. while recent general-purpose systems (e.g., GPT-4o, Flux Kontext, Nano Banana) require detailed prompts and often fail to achieve robust removal on these fine-grained tasks or preserve identity of the scene. Leveraging the common essence of soft effects, i.e., semi-transparent occlusions, we introduce a foundational versatile model UniSER, capable of addressing diverse degradations caused by soft effects within a single framework. Our methodology centers on curating a massive 3.8M-pair dataset to ensure robustness and generalization, which includes novel, physically-plausible data to fill critical gaps in public benchmarks, and a tailored training pipeline that fine-tunes a Diffusion Transformer to learn robust restoration priors from this diverse data, integrating fine-grained mask and strength controls. This synergistic approach allows UniSER to significantly outperform both specialist and generalist models, achieving robust, high-fidelity restoration in the wild.

Paper Structure

This paper contains 29 sections, 6 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Our UniSER eliminates multiple challenging (a) and even undefined (b) soft effects from in-the-wild images while preserving background identities. Besides, UniSER supports precise pixel mask control (c), and removal strength control (d), allowing for intuitive and fine-grained restoration tailored to specific user needs. The framework is also capable of adding effects in the given region (e). Masks are global by default if not shown. A demo video is included in the supplementary materials.
  • Figure 2: Visualization of our curated data samples and synthetic haze by our method.
  • Figure 3: The architecture of UniSER. During training, the mask is randomly synthesized along with a scalar strength, and the supervision is composed by the input image and the original ground truth via the mask and the strength.
  • Figure 4: Comparisons with state-of-the-art specialist and generalist models on in-the-wild testing data. For effect removal, our method significantly outperforms these baselines. Moreover, generalist models fail to preserve the identity of background objects, some of the discrepancies are circled, better view by zooming in.
  • Figure 5: (a) Illustration of Strength Control for effect removal. (b) Illustration of Mask Control for accurate user regional editing. (c) Adding realistic effects to clean image, or enhance current effects for flexible editing purpose. (d) Zero-shot generalization ability on multiple unseen degradations like rain, stain, etc.
  • ...and 5 more figures