Table of Contents
Fetching ...

LLMRA: Multi-modal Large Language Model based Restoration Assistant

Xiaoyu Jin, Yuan Shi, Bin Xia, Wenming Yang

TL;DR

The paper addresses applying multimodal large language models to low-level vision by introducing LLMRA, a system that leverages degradation priors generated by MLLMs to guide a restoration network. It combines an MLLM-based degradation description pipeline with a CLIP-encoded text feature, refined by a Context Enhance Module and a Degradation Context based Transformer (DC-former) that uses Degradation Modulation Modules. The method demonstrates state-of-the-art performance across unified restoration tasks—denoising, deraining, and low-light enhancement—and supports interactive, text-driven restoration via dialogue. Limitations include reliance on the quality of MLLM outputs and the current focus on three degradation types, suggesting future work to broaden degradations and robustness.

Abstract

Multi-modal Large Language Models (MLLMs) have a significant impact on various tasks, due to their extensive knowledge and powerful perception and generation capabilities. However, it still remains an open research problem on applying MLLMs to low-level vision tasks. In this paper, we present a simple MLLM-based Image Restoration framework to address this gap, namely Multi-modal Large Language Model based Restoration Assistant (LLMRA). We exploit the impressive capabilities of MLLMs to obtain the degradation information for universal image restoration. By employing a pretrained multi-modal large language model and a vision language model, we generate text descriptions and encode them as context embedding with degradation information for the degraded image. Through the proposed Context Enhance Module (CEM) and Degradation Context based Transformer Network (DC-former), we integrate these context embedding into the restoration network, contributing to more accurate and adjustable image restoration. Based on the dialogue with the users, our method leverages image degradation priors from MLLMs, providing low-level attributes descriptions of the input low-quality images and the restored high-quality images simultaneously. Extensive experiments demonstrate the superior performance of our LLMRA in universal image restoration tasks.

LLMRA: Multi-modal Large Language Model based Restoration Assistant

TL;DR

The paper addresses applying multimodal large language models to low-level vision by introducing LLMRA, a system that leverages degradation priors generated by MLLMs to guide a restoration network. It combines an MLLM-based degradation description pipeline with a CLIP-encoded text feature, refined by a Context Enhance Module and a Degradation Context based Transformer (DC-former) that uses Degradation Modulation Modules. The method demonstrates state-of-the-art performance across unified restoration tasks—denoising, deraining, and low-light enhancement—and supports interactive, text-driven restoration via dialogue. Limitations include reliance on the quality of MLLM outputs and the current focus on three degradation types, suggesting future work to broaden degradations and robustness.

Abstract

Multi-modal Large Language Models (MLLMs) have a significant impact on various tasks, due to their extensive knowledge and powerful perception and generation capabilities. However, it still remains an open research problem on applying MLLMs to low-level vision tasks. In this paper, we present a simple MLLM-based Image Restoration framework to address this gap, namely Multi-modal Large Language Model based Restoration Assistant (LLMRA). We exploit the impressive capabilities of MLLMs to obtain the degradation information for universal image restoration. By employing a pretrained multi-modal large language model and a vision language model, we generate text descriptions and encode them as context embedding with degradation information for the degraded image. Through the proposed Context Enhance Module (CEM) and Degradation Context based Transformer Network (DC-former), we integrate these context embedding into the restoration network, contributing to more accurate and adjustable image restoration. Based on the dialogue with the users, our method leverages image degradation priors from MLLMs, providing low-level attributes descriptions of the input low-quality images and the restored high-quality images simultaneously. Extensive experiments demonstrate the superior performance of our LLMRA in universal image restoration tasks.
Paper Structure (13 sections, 10 equations, 5 figures, 6 tables)

This paper contains 13 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Example of the proposed LLMRA for universal image restoration. Based on the input image and the text input asking for the low-level attributes of the image, our method is capable of providing corresponding descriptions. Upon the < restore > instruction, our LLMRA leverages the degradation descriptions from the MLLM automatically to restore the image. On the other hand, when instructed with the < refine> command, LLMRA performs image restoration based on the content of the dialogue.
  • Figure 2: The overview of the proposed LLMRA. (a) The proposed LLMRA Framework. DEN, CT and DC-former are used to refine and incorporate the degradation information into the restoration network. (b) Context Enhance Module (CEM). (c) Context Transformer (CT).
  • Figure 3: Degradation Modulation Module (DMM) in DC-former.
  • Figure 4: Visual comparisons with the SOTA methods. Rows 1-2, 3-4, 5-6 rows display the results of image denoising, image deraining and low light image enhancement, respectively. The test images are from Urban100, Rain100L and LOLv1. Zoom in for better visualization.
  • Figure 5: Impact of the text input.