Table of Contents
Fetching ...

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai

TL;DR

PVC tackles the fragmentation between image and video processing in large Vision-Language Models by standardizing inputs as videos and applying progressive encoding with adaptive token compression. It unifies token handling across modalities, enabling 64 tokens per frame to effectively preserve spatial details and temporal dynamics through repeated image frames and temporally-aware ViT layers. The approach achieves state-of-the-art results on long-video and fine-grained video benchmarks while maintaining image-task performance, particularly for detail-sensitive tasks. This demonstrates a versatile, data-efficient pathway to robust multi-modal understanding across both images and videos.

Abstract

Large Vision-Language Models (VLMs) have been extended to understand both images and videos. Visual token compression is leveraged to reduce the considerable token length of visual inputs. To meet the needs of different tasks, existing high-performance models usually process images and videos separately with different token compression strategies, limiting the capabilities of combining images and videos. To this end, we extend each image into a "static" video and introduce a unified token compression strategy called Progressive Visual Token Compression (PVC), where the tokens of each frame are progressively encoded and adaptively compressed to supplement the information not extracted from previous frames. Video tokens are efficiently compressed with exploiting the inherent temporal redundancy. Images are repeated as static videos, and the spatial details can be gradually supplemented in multiple frames. PVC unifies the token compressing of images and videos. With a limited number of tokens per frame (64 tokens by default), spatial details and temporal changes can still be preserved. Experiments show that our model achieves state-of-the-art performance across various video understanding benchmarks, including long video tasks and fine-grained short video tasks. Meanwhile, our unified token compression strategy incurs no performance loss on image benchmarks, particularly in detail-sensitive tasks.

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

TL;DR

PVC tackles the fragmentation between image and video processing in large Vision-Language Models by standardizing inputs as videos and applying progressive encoding with adaptive token compression. It unifies token handling across modalities, enabling 64 tokens per frame to effectively preserve spatial details and temporal dynamics through repeated image frames and temporally-aware ViT layers. The approach achieves state-of-the-art results on long-video and fine-grained video benchmarks while maintaining image-task performance, particularly for detail-sensitive tasks. This demonstrates a versatile, data-efficient pathway to robust multi-modal understanding across both images and videos.

Abstract

Large Vision-Language Models (VLMs) have been extended to understand both images and videos. Visual token compression is leveraged to reduce the considerable token length of visual inputs. To meet the needs of different tasks, existing high-performance models usually process images and videos separately with different token compression strategies, limiting the capabilities of combining images and videos. To this end, we extend each image into a "static" video and introduce a unified token compression strategy called Progressive Visual Token Compression (PVC), where the tokens of each frame are progressively encoded and adaptively compressed to supplement the information not extracted from previous frames. Video tokens are efficiently compressed with exploiting the inherent temporal redundancy. Images are repeated as static videos, and the spatial details can be gradually supplemented in multiple frames. PVC unifies the token compressing of images and videos. With a limited number of tokens per frame (64 tokens by default), spatial details and temporal changes can still be preserved. Experiments show that our model achieves state-of-the-art performance across various video understanding benchmarks, including long video tasks and fine-grained short video tasks. Meanwhile, our unified token compression strategy incurs no performance loss on image benchmarks, particularly in detail-sensitive tasks.

Paper Structure

This paper contains 20 sections, 6 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Comparison of token encoding and compression in VLMs.(a) Existing VLMs compress image and video tokens separately, leading to inconsistency: more tokens per image benefit image spatial perception, while videos tend to sacrifice some tokens per frame to accommodate more frames. (b) Our progressive compression (PVC) achieves unified compression of images and videos, allowing for the continuous supplementation of image details and temporal dynamic information in subsequent frames.
  • Figure 2: Network architecture of progressive visual token compression (PVC). The inputs are standardized as videos, with images repeated to form static videos. A causal temporal attention and an AdaLN layer are incorporated into the ViT layers to progressively encode visual tokens across timesteps. The adaptive compression module, based on PixelShuffle, includes an AdaLN layer to reduce redundancy in visual tokens.
  • Figure 3: Analysis of progressive compression. We compare our PVC model with the baseline without progressive compression (setting (b)) in Tab. \ref{['tab:ablation']}. For video tasks (MVBench and VideoMME), we test different number of input frames. For image tasks (InforVQA and MMB), we test different repetition times of the image.
  • Figure 4: PVC achieves image progressive encoding. The image is repeated once (left) and four times (right). Supplementary contents are marked in blue, incorrect contents in red, and corrected contents in green.
  • Figure 5: PVC effectively captures spatiotemporal dynamics in videos. Correct descriptions of the movements and interactions of the objects are marked in blue, while incorrect descriptions are marked in red. For visualization, we select the above 8 key frames from the video, while the entire video is fed into the models.