Table of Contents
Fetching ...

CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs

Jingyu Lei, Gaoang Wang, Der-Horng Lee

TL;DR

CORE tackles the prohibitive cost of LVLMs by introducing object-centric token merging guided by segmentation masks, producing a compact, semantically meaningful set of tokens and restoring spatial order via centroid-based sorting. Built on a shared ConvNeXt-L backbone and Mask2Former segmentation head, CORE merges tokens per object and feeds an LLM through a projection layer, enabling end-to-end efficiency. The approach delivers state-of-the-art performance on six fixed-rate benchmarks and dramatic efficiency gains in adaptive-rate regimes, retaining up to 97.4% of baseline performance with only 2.2% of tokens. This object-centric paradigm preserves semantic and spatial cues, offering robust, scalable processing for LVLMs and enabling applications in retrieval, robotics perception, and surveillance.

Abstract

Large Vision-Language Models (LVLMs) usually suffer from prohibitive computational and memory costs due to the quadratic growth of visual tokens with image resolution. Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens into a compact set of object-centric representations. Furthermore, a novel centroid-guided sorting mechanism restores a coherent spatial order to the merged tokens, preserving vital positional information. Extensive experiments show that CORE not only establishes a new state-of-the-art on six authoritative benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings. Even under extreme compression, after aggressively retaining with only 2.2% of all visual tokens, CORE still maintains 97.4% of baseline performance. Our work demonstrates the superiority of object-centric representations for efficient and effective LVLM processing.

CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs

TL;DR

CORE tackles the prohibitive cost of LVLMs by introducing object-centric token merging guided by segmentation masks, producing a compact, semantically meaningful set of tokens and restoring spatial order via centroid-based sorting. Built on a shared ConvNeXt-L backbone and Mask2Former segmentation head, CORE merges tokens per object and feeds an LLM through a projection layer, enabling end-to-end efficiency. The approach delivers state-of-the-art performance on six fixed-rate benchmarks and dramatic efficiency gains in adaptive-rate regimes, retaining up to 97.4% of baseline performance with only 2.2% of tokens. This object-centric paradigm preserves semantic and spatial cues, offering robust, scalable processing for LVLMs and enabling applications in retrieval, robotics perception, and surveillance.

Abstract

Large Vision-Language Models (LVLMs) usually suffer from prohibitive computational and memory costs due to the quadratic growth of visual tokens with image resolution. Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens into a compact set of object-centric representations. Furthermore, a novel centroid-guided sorting mechanism restores a coherent spatial order to the merged tokens, preserving vital positional information. Extensive experiments show that CORE not only establishes a new state-of-the-art on six authoritative benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings. Even under extreme compression, after aggressively retaining with only 2.2% of all visual tokens, CORE still maintains 97.4% of baseline performance. Our work demonstrates the superiority of object-centric representations for efficient and effective LVLM processing.

Paper Structure

This paper contains 25 sections, 1 equation, 23 figures, 7 tables, 2 algorithms.

Figures (23)

  • Figure 1: CORE's Performance and Efficiency. (a) When retaining with only 160, 320, and 640 tokens, CORE outperforms current state-of-the-art efficient LVLMs, such as VisionZip yang2024visionzip and DivPrune alvar2025divprune, across six benchmarks. (b) Under its highest compression ratio, CORE reduces FLOPs by 16.0$\times$, KV Cache by 182.3$\times$, and GPU Memory by 2.7$\times$, while still maintaining 97.4% of its baseline performance.
  • Figure 1: Comparison on Fixed-rate Compression Tasks. For fair comparison, the blue percentage values show the retained performance with fixed tokens, compared with full-token CORE model (ConvNeXt-L backbone) which serves as the 100% baseline.
  • Figure 2: Overview of CORE. Our framework consists of two key pathways. The primary data flow, indicated by solid lines, shows how compact object-centric representations are generated and processed by the language decoder. This process is informed by the auxiliary segmentation head, shown with dashed lines, which produces the object masks that guide the token merging. The icon in the top-left corner of each mask denotes a different object in the image.
  • Figure 3: Illustration of Centroid-Guided Sorting. Assume $N=3$. In Step (a), the number in a token indicates the $i$-th token. For simplicity, darker (lighter) tokens represent a 0.9 (0.1) weight in $P_n$. In Step (b), the number in token $t_n$ indicates the centroid position $c_n$. The tokens are merged without sorting. Step (c) shows the final merged tokens $T$ sorted in ascending order based on their centroid values.
  • Figure 4: Visualization Comparison
  • ...and 18 more figures