LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

Boyang Zheng; Jinjin Gu; Shijun Li; Chao Dong

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

Boyang Zheng, Jinjin Gu, Shijun Li, Chao Dong

TL;DR

LM4LV demonstrates that a frozen large language model can process and generate low-level visual features without any multimodal data. By pairing a fine-tuned MAE decoder with two linear adapters and an autoregressive scheme for visual and textual tokens, the approach achieves meaningful improvements on restoration tasks (e.g., PSNR up to $+6.81$ dB, average $+3.96$ dB; SSIM up to $+0.09$) and competitive results on spatial operations. The study emphasizes the importance of a reconstruction-focused vision module (MAE) and autoregressive generation for success, while also acknowledging limitations such as limited high-frequency detail recovery. Overall, the work provides evidence that frozen LLMs can serve as capable processors of low-level visual features, offering new perspectives on LLM capabilities and potential cross-domain applications.

Abstract

The success of large language models (LLMs) has fostered a new research trend of multi-modality large language models (MLLMs), which changes the paradigm of various fields in computer vision. Though MLLMs have shown promising results in numerous high-level vision and vision-language tasks such as VQA and text-to-image, no works have demonstrated how low-level vision tasks can benefit from MLLMs. We find that most current MLLMs are blind to low-level features due to their design of vision modules, thus are inherently incapable for solving low-level vision tasks. In this work, we purpose $\textbf{LM4LV}$, a framework that enables a FROZEN LLM to solve a range of low-level vision tasks without any multi-modal data or prior. This showcases the LLM's strong potential in low-level vision and bridges the gap between MLLMs and low-level vision tasks. We hope this work can inspire new perspectives on LLMs and deeper understanding of their mechanisms. Code is available at https://github.com/bytetriper/LM4LV.

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

TL;DR

dB, average

dB; SSIM up to

) and competitive results on spatial operations. The study emphasizes the importance of a reconstruction-focused vision module (MAE) and autoregressive generation for success, while also acknowledging limitations such as limited high-frequency detail recovery. Overall, the work provides evidence that frozen LLMs can serve as capable processors of low-level visual features, offering new perspectives on LLM capabilities and potential cross-domain applications.

Abstract

, a framework that enables a FROZEN LLM to solve a range of low-level vision tasks without any multi-modal data or prior. This showcases the LLM's strong potential in low-level vision and bridges the gap between MLLMs and low-level vision tasks. We hope this work can inspire new perspectives on LLMs and deeper understanding of their mechanisms. Code is available at https://github.com/bytetriper/LM4LV.

Paper Structure (30 sections, 7 equations, 15 figures, 6 tables)

This paper contains 30 sections, 7 equations, 15 figures, 6 tables.

Introduction
Related Works
Multi-modal Generation with LLMs
Frozen LLM for Tasks of Other Modalities
Method
Current MLLMs are BLIND to Low-level Features
Enable LLM to See Low-level Features
Next Element Prediction on Low-level Vision Tasks
LLM's Capability on Low-level Tasks
Experiment Setup
LLM Shows Non-trivial Capability on Low-level Vision Tasks
Choice of Vision Module Matters
Auto-regressive Generation Matters
Abalation Studies
Is the Linear Layer Doing the Task?
...and 15 more sections

Figures (15)

Figure 1: Reconstruction results of the vision modules in different MLLMs. Emu2 provides highly semantic consistent images but fails to maintain low-level details, while MAE can reconstruct images with precise low-level details.
Figure 2: Network structure of our design. In the training phase, the visual tokens and the task tokens learns to prompt the LLM to generate next visual/text tokens. In the inference phase, the LLM generates visual tokens and text tokens in an auto-regressive manner. The visual tokens are then decoded into images.
Figure 3: A frozen LLM shows non-trivial capability on various low-level vision tasks.
Figure 4: All three modules succeed in performing image repetition, but VQGAN and BEiT totally fail for image rotation.
Figure 5: ViT-LLM generation fails for image denoising even when the noise level is low (2nd row), producing low-quality and blurred images.
...and 10 more figures

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

TL;DR

Abstract

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (15)