Table of Contents
Fetching ...

FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression

Jianjian Li, Junquan Fan, Feng Tang, Gang Huang, Shitao Zhu, Songlin Liu, Nian Xie, Wulong Liu, Yong Liao

TL;DR

FCoT-VL addresses the inefficiency of high-resolution Vision-Language LLMs by compressing visual tokens with a self-distillation framework that transfers token-importance from a heavy teacher (InternVL2) to a lighter student. The approach learns only lightweight components (A_s and V_c) during re-alignment and supplements with a post-training stage—utilizing high-quality instruction data, chain-of-thought augmentation, and model merging—to recover performance losses. Evaluations across nine text-oriented benchmarks show that FCoT-VL achieves competitive or superior results while substantially reducing token counts and accelerating inference, particularly at 50% and 75% compression ratios. The method demonstrates practical impact for resource-constrained deployment of text-focused VLLMs, offering flexible compression modules and data-efficient training strategies.

Abstract

The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs, InternVL2. Experimental results show that our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks. We will release the models and code soon.

FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression

TL;DR

FCoT-VL addresses the inefficiency of high-resolution Vision-Language LLMs by compressing visual tokens with a self-distillation framework that transfers token-importance from a heavy teacher (InternVL2) to a lighter student. The approach learns only lightweight components (A_s and V_c) during re-alignment and supplements with a post-training stage—utilizing high-quality instruction data, chain-of-thought augmentation, and model merging—to recover performance losses. Evaluations across nine text-oriented benchmarks show that FCoT-VL achieves competitive or superior results while substantially reducing token counts and accelerating inference, particularly at 50% and 75% compression ratios. The method demonstrates practical impact for resource-constrained deployment of text-focused VLLMs, offering flexible compression modules and data-efficient training strategies.

Abstract

The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs, InternVL2. Experimental results show that our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks. We will release the models and code soon.

Paper Structure

This paper contains 28 sections, 7 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Comparison of scores between FastV and FCoT-VL on different types of benchmarks. FastV gets a significant decline in tasks that require high resolution like DocVQA and InfoVQA. In contrast, our method shows a minor performance degradation.
  • Figure 2: Overall Structure of FCoT-VL. FCoT-VL is a self-distillation architecture in which only the Student-Projector and Compress-Module are learned, while all the other modules remain frozen. The student and teacher models share the same ViT encoder and the LLM decoder.
  • Figure 3: Performance percentage across multiple benchmarks under different compression ratios on the InternVL2-2B (left) and InternVL2-8B (right) models.
  • Figure 4: The loss graphs of re-alignment pre-training. The loss undergoes a rapid loss reduction and a long smooth convergence.
  • Figure 5: Model performance changes across intermediate training iterations.
  • ...and 13 more figures