LM4LV: A Frozen Large Language Model for Low-level Vision Tasks
Boyang Zheng, Jinjin Gu, Shijun Li, Chao Dong
TL;DR
LM4LV demonstrates that a frozen large language model can process and generate low-level visual features without any multimodal data. By pairing a fine-tuned MAE decoder with two linear adapters and an autoregressive scheme for visual and textual tokens, the approach achieves meaningful improvements on restoration tasks (e.g., PSNR up to $+6.81$ dB, average $+3.96$ dB; SSIM up to $+0.09$) and competitive results on spatial operations. The study emphasizes the importance of a reconstruction-focused vision module (MAE) and autoregressive generation for success, while also acknowledging limitations such as limited high-frequency detail recovery. Overall, the work provides evidence that frozen LLMs can serve as capable processors of low-level visual features, offering new perspectives on LLM capabilities and potential cross-domain applications.
Abstract
The success of large language models (LLMs) has fostered a new research trend of multi-modality large language models (MLLMs), which changes the paradigm of various fields in computer vision. Though MLLMs have shown promising results in numerous high-level vision and vision-language tasks such as VQA and text-to-image, no works have demonstrated how low-level vision tasks can benefit from MLLMs. We find that most current MLLMs are blind to low-level features due to their design of vision modules, thus are inherently incapable for solving low-level vision tasks. In this work, we purpose $\textbf{LM4LV}$, a framework that enables a FROZEN LLM to solve a range of low-level vision tasks without any multi-modal data or prior. This showcases the LLM's strong potential in low-level vision and bridges the gap between MLLMs and low-level vision tasks. We hope this work can inspire new perspectives on LLMs and deeper understanding of their mechanisms. Code is available at https://github.com/bytetriper/LM4LV.
