MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration

Lingshun Kong; Jiawei Zhang; Zhengpeng Duan; Xiaohe Wu; Yueqi Yang; Xiaotao Wang; Dongqing Zou; Lei Lei; Jinshan Pan

MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration

Lingshun Kong, Jiawei Zhang, Zhengpeng Duan, Xiaohe Wu, Yueqi Yang, Xiaotao Wang, Dongqing Zou, Lei Lei, Jinshan Pan

TL;DR

A unified image restoration framework that integrates a dual-level Mixture-of-Experts architecture with a pretrained diffusion model that enables the model to achieve coarse-grained adaptation across diverse degradation categories while performing fine-grained modulation for specific intra-class variations.

Abstract

All-in-one image restoration is challenging because different degradation types, such as haze, blur, noise, and low-light, impose diverse requirements on restoration strategies, making it difficult for a single model to handle them effectively. In this paper, we propose a unified image restoration framework that integrates a dual-level Mixture-of-Experts (MoE) architecture with a pretrained diffusion model. The framework operates at two levels: the Inter-MoE layer adaptively combines expert groups to handle major degradation types, while the Intra-MoE layer further selects specialized sub-experts to address fine-grained variations within each type. This design enables the model to achieve coarse-grained adaptation across diverse degradation categories while performing fine-grained modulation for specific intra-class variations, ensuring both high specialization in handling complex, real-world corruptions. Extensive experiments demonstrate that the proposed method performs favorably against the state-of-the-art approaches on multiple image restoration task.

MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration

TL;DR

Abstract

Paper Structure (11 sections, 5 equations, 6 figures, 6 tables)

This paper contains 11 sections, 5 equations, 6 figures, 6 tables.

Introduction
Related Work
Preliminaries
Proposed Method
Hierarchical MoE in MoE
Conditional Generation for DiT with MiM
Experiments
Experimental Settings
Comparisons with Other Methods
Analysis and Discussion
Conclusion

Figures (6)

Figure 1: Homogeneous MoE vs. the proposed MiM. The symbols in the figure are consistent with those defined in our method section and share the same meanings. Our hierarchical MiM adopts a dual-level routing mechanism, which ensures dynamic and specialized processing by routing inputs to appropriate architectural priors and fine-grained specialists. Compared with homogeneous MoE, our method enables better restoration for blur, haze, and low-light conditions through adaptive structural selection.
Figure 2: Overview of the proposed Hierarchical MoE in MoE (MiM) integrated into the DiT backbone. The framework processes low-quality (LQ) images through a series of MiM-DiT blocks. Given the LQ input, the MiM module extracts degradation-specific features and processes them through a hierarchical MoE architecture composed of two levels: Inter-MoE and Intra-MoE. At the Inter-MoE level, four expert groups based on distinct attention mechanisms—spatial self-attention channel, self-attention, Swin attention, and SE attention—are combined via a dense router that computes adaptive weights over all groups. This dense fusion enables the model to leverage complementary inductive biases. Within each expert group, Intra-MoE captures fine-grained variations within each degradation category via sparse routing. These processed features are then injected as conditional input into the DiT backbone through a Zero-Linear pathway, dynamically guiding the diffusion process to generate restoration results.
Figure 3: Deblurred results on the FoundIR dataset foundir. The deblurred results in (c)-(g) still contain significant blur effects. In contrast, our method generates clear results.
Figure 4: Dehazed results on the FoundIR dataset foundir. The results in (c) to (g) fail to fully restore the original scene content while removing haze. In contrast, our method generates clear and faithful reconstructions.
Figure 5: Low-light enhanced results on the FoundIR dataset foundir. Results from (c) to (g) suffer from color casts and detail smearing. In contrast, our method recovers accurate colors and fine structures.
...and 1 more figures

MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration

TL;DR

Abstract

MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration

Authors

TL;DR

Abstract

Table of Contents

Figures (6)