Table of Contents
Fetching ...

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

Ziqi Pang, Ziyang Xie, Yunze Man, Yu-Xiong Wang

TL;DR

The paper demonstrates that a frozen transformer block from pre-trained LLMs can serve as a powerful, general-purpose visual encoder across a wide range of tasks, including 2D/3D recognition, video understanding, motion forecasting, and vision-language challenges, without language prompts or joint pretraining. By inserting a frozen LLM block between the visual encoder and decoder and placing two trainable adapters, the approach yields consistent improvements across diverse architectures and tasks, suggesting the LLMs’ transformer representations capture transferable visual information. The authors propose the information filtering hypothesis to explain this phenomenon, showing both qualitative and quantitative evidence that the frozen LLM block enhances focus on informative visual tokens and amplifies their downstream impact. These findings challenge conventional VLM design, offer a scalable, modular means to leverage LLMs for vision, and invite further exploration of the mechanisms underlying LLMs’ cross-modal capabilities and token-level information processing.

Abstract

This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing pure 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and OPT) and different LLM transformer blocks. We additionally propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding -- the pre-trained LLM transformer blocks discern informative visual tokens and further amplify their effect. This hypothesis is empirically supported by the observation that the feature activation, after training with LLM transformer blocks, exhibits a stronger focus on relevant regions. We hope that our work inspires new perspectives on utilizing LLMs and deepening our understanding of their underlying mechanisms. Code is available at https://github.com/ziqipang/LM4VisualEncoding.

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

TL;DR

The paper demonstrates that a frozen transformer block from pre-trained LLMs can serve as a powerful, general-purpose visual encoder across a wide range of tasks, including 2D/3D recognition, video understanding, motion forecasting, and vision-language challenges, without language prompts or joint pretraining. By inserting a frozen LLM block between the visual encoder and decoder and placing two trainable adapters, the approach yields consistent improvements across diverse architectures and tasks, suggesting the LLMs’ transformer representations capture transferable visual information. The authors propose the information filtering hypothesis to explain this phenomenon, showing both qualitative and quantitative evidence that the frozen LLM block enhances focus on informative visual tokens and amplifies their downstream impact. These findings challenge conventional VLM design, offer a scalable, modular means to leverage LLMs for vision, and invite further exploration of the mechanisms underlying LLMs’ cross-modal capabilities and token-level information processing.

Abstract

This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing pure 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and OPT) and different LLM transformer blocks. We additionally propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding -- the pre-trained LLM transformer blocks discern informative visual tokens and further amplify their effect. This hypothesis is empirically supported by the observation that the feature activation, after training with LLM transformer blocks, exhibits a stronger focus on relevant regions. We hope that our work inspires new perspectives on utilizing LLMs and deepening our understanding of their underlying mechanisms. Code is available at https://github.com/ziqipang/LM4VisualEncoding.
Paper Structure (37 sections, 11 equations, 13 figures, 13 tables)

This paper contains 37 sections, 11 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Our straightforward method of using a frozen transformer block from pre-trained LLMs as a visual encoder layer. Visualized with an example of ViT dosovitskiy2020image. (a) Our design simply appends a frozen transformer block (pink) on top of the regular visual encoder (gray). Only two trainable linear layers (green) are added to align the feature dimensions. (b) Pytorch-style pseudo-code shows the simplicity of our approach.
  • Figure 2: Various LLM transformer layers improve the accuracy.
  • Figure 3: (a) Feature activation regarding both magnitudes and frequencies of features. We highlight that ViT-LLaMA demonstrates the emergent tendency of object segmentation compared with ViT, indicating its ability to select informative tokens. (b) Attention scores between the $\mathtt{CLS}$ and visual tokens. The attention from ViT is commonly noisy (left). Though ViT-LLaMA improves the concentration on a few heads, most of the attention heads are still noisy. Both good and bad attention from ViT-LLaMA are sampled for demonstration purpose.
  • Figure 4: Pseudo-masks from ViT-LLaMA's features ($\mathbf{F}_{L}^2$) have larger mIoU than attention scores and ViT.
  • Figure 5: Token activation in action recognition. Video tokens are activated jointly in all the frames, and every video token is a cube with shape $2\times 16\times 16$. After adding the LLM transformer, the model better concentrates on the relevant objects and hands ("low threshold") and more accurately focuses on frames with hand-object interaction ("high threshold").
  • ...and 8 more figures