Table of Contents
Fetching ...

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

Boyang Zheng, Jinjin Gu, Shijun Li, Chao Dong

TL;DR

LM4LV demonstrates that a frozen large language model can process and generate low-level visual features without any multimodal data. By pairing a fine-tuned MAE decoder with two linear adapters and an autoregressive scheme for visual and textual tokens, the approach achieves meaningful improvements on restoration tasks (e.g., PSNR up to $+6.81$ dB, average $+3.96$ dB; SSIM up to $+0.09$) and competitive results on spatial operations. The study emphasizes the importance of a reconstruction-focused vision module (MAE) and autoregressive generation for success, while also acknowledging limitations such as limited high-frequency detail recovery. Overall, the work provides evidence that frozen LLMs can serve as capable processors of low-level visual features, offering new perspectives on LLM capabilities and potential cross-domain applications.

Abstract

The success of large language models (LLMs) has fostered a new research trend of multi-modality large language models (MLLMs), which changes the paradigm of various fields in computer vision. Though MLLMs have shown promising results in numerous high-level vision and vision-language tasks such as VQA and text-to-image, no works have demonstrated how low-level vision tasks can benefit from MLLMs. We find that most current MLLMs are blind to low-level features due to their design of vision modules, thus are inherently incapable for solving low-level vision tasks. In this work, we purpose $\textbf{LM4LV}$, a framework that enables a FROZEN LLM to solve a range of low-level vision tasks without any multi-modal data or prior. This showcases the LLM's strong potential in low-level vision and bridges the gap between MLLMs and low-level vision tasks. We hope this work can inspire new perspectives on LLMs and deeper understanding of their mechanisms. Code is available at https://github.com/bytetriper/LM4LV.

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

TL;DR

LM4LV demonstrates that a frozen large language model can process and generate low-level visual features without any multimodal data. By pairing a fine-tuned MAE decoder with two linear adapters and an autoregressive scheme for visual and textual tokens, the approach achieves meaningful improvements on restoration tasks (e.g., PSNR up to dB, average dB; SSIM up to ) and competitive results on spatial operations. The study emphasizes the importance of a reconstruction-focused vision module (MAE) and autoregressive generation for success, while also acknowledging limitations such as limited high-frequency detail recovery. Overall, the work provides evidence that frozen LLMs can serve as capable processors of low-level visual features, offering new perspectives on LLM capabilities and potential cross-domain applications.

Abstract

The success of large language models (LLMs) has fostered a new research trend of multi-modality large language models (MLLMs), which changes the paradigm of various fields in computer vision. Though MLLMs have shown promising results in numerous high-level vision and vision-language tasks such as VQA and text-to-image, no works have demonstrated how low-level vision tasks can benefit from MLLMs. We find that most current MLLMs are blind to low-level features due to their design of vision modules, thus are inherently incapable for solving low-level vision tasks. In this work, we purpose , a framework that enables a FROZEN LLM to solve a range of low-level vision tasks without any multi-modal data or prior. This showcases the LLM's strong potential in low-level vision and bridges the gap between MLLMs and low-level vision tasks. We hope this work can inspire new perspectives on LLMs and deeper understanding of their mechanisms. Code is available at https://github.com/bytetriper/LM4LV.
Paper Structure (30 sections, 7 equations, 15 figures, 6 tables)

This paper contains 30 sections, 7 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Reconstruction results of the vision modules in different MLLMs. Emu2 provides highly semantic consistent images but fails to maintain low-level details, while MAE can reconstruct images with precise low-level details.
  • Figure 2: Network structure of our design. In the training phase, the visual tokens and the task tokens learns to prompt the LLM to generate next visual/text tokens. In the inference phase, the LLM generates visual tokens and text tokens in an auto-regressive manner. The visual tokens are then decoded into images.
  • Figure 3: A frozen LLM shows non-trivial capability on various low-level vision tasks.
  • Figure 4: All three modules succeed in performing image repetition, but VQGAN and BEiT totally fail for image rotation.
  • Figure 5: ViT-LLM generation fails for image denoising even when the noise level is low (2nd row), producing low-quality and blurred images.
  • ...and 10 more figures