Table of Contents
Fetching ...

HybridToken-VLM: Hybrid Token Compression for Vision-Language Models

Jusheng Zhang, Xiaoyang Guo, Kaitong Cai, Qinhan Lv, Yijia Fan, Wenhao Chai, Jian Wang, Keze Wang

TL;DR

HTC-VLM tackles the vision–language bottleneck by introducing a disentangled hybrid compression that separates high-level semantic anchors from low-level appearance details. It uses four discrete semantic tokens (MGVQ) plus 576 continuous ViT patches, fused and compressed into a single <voco> token via a disentanglement mask, achieving 580-to-1 compression with 87.2% average retention across seven benchmarks. The approach provides interpretable attention patterns that prioritize discrete anchors, and ablations show the four-token discrete setup and pre-fusion strategy yield best performance. The work demonstrates that embedding semantic structure before compression enables efficient, faithful multimodal reasoning and offers a scalable path for deploying VLMs with constrained context windows.

Abstract

Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.

HybridToken-VLM: Hybrid Token Compression for Vision-Language Models

TL;DR

HTC-VLM tackles the vision–language bottleneck by introducing a disentangled hybrid compression that separates high-level semantic anchors from low-level appearance details. It uses four discrete semantic tokens (MGVQ) plus 576 continuous ViT patches, fused and compressed into a single <voco> token via a disentanglement mask, achieving 580-to-1 compression with 87.2% average retention across seven benchmarks. The approach provides interpretable attention patterns that prioritize discrete anchors, and ablations show the four-token discrete setup and pre-fusion strategy yield best performance. The work demonstrates that embedding semantic structure before compression enables efficient, faithful multimodal reasoning and offers a scalable path for deploying VLMs with constrained context windows.

Abstract

Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.

Paper Structure

This paper contains 50 sections, 16 equations, 4 figures, 12 tables, 3 algorithms.

Figures (4)

  • Figure 1: Vision-token compression. (a) VoCo-LLaMA collapses 576 patches into one <voco> token, losing semantic structure. (b) HTC-VLM adds 4 discrete semantic tokens and compresses all into one <voco> token, preserving semantics and visual detail.
  • Figure 2: Comparison of visual token compression strategies. (a) Pooling Method: visual embeddings are averaged or pooled before being fused with text inputs. (b) VoCo-LLaMA: compresses 576 visual tokens into a single <voco> token. (c) HTC-VLM (ours): introduces a hybrid representation with a continuous channel ($D$) encoding 576 patch embeddings and a discrete channel ($S$) generating 4 semantic tokens via MGVQ. The hybrid sequence $[v_d; V]$ is compressed into a trainable <voco> token under the disentanglement mask $M_{hy}$, producing latent $z$ that preserves both semantics and fine-grained details.
  • Figure 3: Comparison of compression strategies and their effect on visual token attention. Left: Attention heatmap of the <voco> token in HTC-VLM over 4 discrete semantic token plus the first 12 image patch tokens for 16 test samples from the MME benchmark. Right: Attention heatmap of the <voco> token in the original VoCo-LLaMA ye2025voco model over the first 16 image patch tokens for the same 16 test samples.
  • Figure 4: Performance vs. visual token budget on GQA/VQAv2. HTC-VLM maintains higher accuracy under extreme compression while matching the efficiency of single-token baselines.