Table of Contents
Fetching ...

Tuning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models

Purui Bai, Junxian Duan, Pin Wang, Jinhua Hao, Ming Sun, Chao Zhou, Huaibo Huang

Abstract

Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.

Tuning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models

Abstract

Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.
Paper Structure (23 sections, 5 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 5 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: ResFlow-Tuner delivers superior performance on both synthetic (the first row) and real-world (the second row) benchmarks, excelling in terms of perceptual quality and objective image quality assessment.
  • Figure 2: Architecture of the proposed ResFlow-Tuner. ResFlow-Tuner enhances training performance through the seamless integration of multi-modal guidance. During inference, it adopts a greedy optimization strategy for path selection, augmented by our Multi-Step Partial Denoising Estimator (MSPDE) for more accurate path evaluation.
  • Figure 3: Qualitative comparisons on both synthetic (the first row) and real-world (the last three rows) benchmarks. Please zoom in for a better view.
  • Figure 4: User Study Results. (a) Average ranking of the six methods across all participants and test images, with error bars. (b) Top-K ratios (K=1,2,3,4,5) demonstrating our method's consistency in producing high-quality results across diverse image content.
  • Figure 5: Visual comparisons for ablation study on ResFlow-Tuner (1/2).
  • ...and 5 more figures