Table of Contents
Fetching ...

Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning

Bizhe Bai, Jianjian Cao, Yadan Luo, Tao Chen

TL;DR

This work tackles the high computational cost of Grounded Conversation Generation (GCG) by preserving local visual information during token pruning. It introduces Adaptive Local-Aware Token Pruning (ALTP), which combines Detail Density Capture (DDC) with Dynamic Density Formation (DDF) to retain region-specific tokens proportional to information density. Experiments on GranDf with GLaMM and OMG-LLaVA show ALTP consistently outperforms FastV and PyramidDrop, achieving up to a 90% reduction in visual tokens while improving AP50, Recall, and mIOU. The results demonstrate that local object details are crucial for grounded dialogue grounding, enabling faster inference without sacrificing grounding accuracy and segmentation quality.

Abstract

Grounded Conversation Generation (GCG) is an emerging vision-language task that requires models to generate natural language responses seamlessly intertwined with corresponding object segmentation masks. Recent models, such as GLaMM and OMG-LLaVA, achieve pixel-level grounding but incur significant computational costs due to processing a large number of visual tokens. Existing token pruning methods, like FastV and PyramidDrop, fail to preserve the local visual features critical for accurate grounding, leading to substantial performance drops in GCG tasks. To address this, we propose Adaptive Local-Aware Token Pruning (ALTP), a simple yet effective framework that accelerates GCG models by prioritizing local object information. ALTP introduces two key components: (1) Detail Density Capture (DDC), which uses superpixel segmentation to retain tokens in object-centric regions, preserving fine-grained details, and (2) Dynamic Density Formation (DDF), which dynamically allocates tokens based on information density, ensuring higher retention in semantically rich areas. Extensive experiments on the GranDf dataset demonstrate that ALTP significantly outperforms existing token pruning methods, such as FastV and PyramidDrop, on both GLaMM and OMG-LLaVA models. Notably, when applied to GLaMM, ALTP achieves a 90% reduction in visual tokens with a 4.9% improvement in AP50 and a 5.0% improvement in Recall compared to PyramidDrop. Similarly, on OMG-LLaVA, ALTP improves AP by 2.1% and mIOU by 3.0% at a 90% token reduction compared with PDrop.

Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning

TL;DR

This work tackles the high computational cost of Grounded Conversation Generation (GCG) by preserving local visual information during token pruning. It introduces Adaptive Local-Aware Token Pruning (ALTP), which combines Detail Density Capture (DDC) with Dynamic Density Formation (DDF) to retain region-specific tokens proportional to information density. Experiments on GranDf with GLaMM and OMG-LLaVA show ALTP consistently outperforms FastV and PyramidDrop, achieving up to a 90% reduction in visual tokens while improving AP50, Recall, and mIOU. The results demonstrate that local object details are crucial for grounded dialogue grounding, enabling faster inference without sacrificing grounding accuracy and segmentation quality.

Abstract

Grounded Conversation Generation (GCG) is an emerging vision-language task that requires models to generate natural language responses seamlessly intertwined with corresponding object segmentation masks. Recent models, such as GLaMM and OMG-LLaVA, achieve pixel-level grounding but incur significant computational costs due to processing a large number of visual tokens. Existing token pruning methods, like FastV and PyramidDrop, fail to preserve the local visual features critical for accurate grounding, leading to substantial performance drops in GCG tasks. To address this, we propose Adaptive Local-Aware Token Pruning (ALTP), a simple yet effective framework that accelerates GCG models by prioritizing local object information. ALTP introduces two key components: (1) Detail Density Capture (DDC), which uses superpixel segmentation to retain tokens in object-centric regions, preserving fine-grained details, and (2) Dynamic Density Formation (DDF), which dynamically allocates tokens based on information density, ensuring higher retention in semantically rich areas. Extensive experiments on the GranDf dataset demonstrate that ALTP significantly outperforms existing token pruning methods, such as FastV and PyramidDrop, on both GLaMM and OMG-LLaVA models. Notably, when applied to GLaMM, ALTP achieves a 90% reduction in visual tokens with a 4.9% improvement in AP50 and a 5.0% improvement in Recall compared to PyramidDrop. Similarly, on OMG-LLaVA, ALTP improves AP by 2.1% and mIOU by 3.0% at a 90% token reduction compared with PDrop.

Paper Structure

This paper contains 21 sections, 6 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a): Result of original Glamm rasheed2024glammpixelgroundinglarge. (b): Result of Glamm with Fastv chen2024imageworth12tokensfastv method pruning 90% of visual token starting from the second layer. (c) : Result from Glamm with Fastv method pruning 75% of visiual tokens starting from the second layer. (d): Results from GlaMM using FastV chen2024imageworth12tokensfastv with 90% visual token pruning and retaining additional visual tokens retained at the curtain location, starting from the second layer. Comparing (c) and (d), we could conclude that "Local Information Matters" : preserving visual tokens corresponding to object locations provides richer object information to the vision-language model.
  • Figure 2: (a): Result of original OMG-Llava zhang2024omgllavabridgingimagelevelobjectlevel. (b): Result of OMG-Llava with Fastv chen2024imageworth12tokensfastv method pruning 90% of visual token starting from the second layer. (c) Result of pruning 75% of visual tokens starting from the second layer. (d) Pruning 90% visual token pruning and retaining additional visual tokens retained at the tree location. Comparing (c) and (d), we can conclude the same conclusion as Figure \ref{['fig:glamm_pre_resul']}: “Local Information Matters” .
  • Figure 3: Overview of the proposed Adaptive Local-Aware Token Pruning (ALTP) framework. It comprises two main components: the Detail Density Capture (DDC) module and the Dynamic Density Formation (DDF) module. The DDC module segments the image into semantically coherent sub-areas, ensuring that a larger proportion of tokens corresponding to the location of detail density regions will be retained for precise pixel-level grounding. Meanwhile, the DDF module dynamically adjusts token allocation within each region based on information density, allowing for an adaptive pruning strategy that ensuring higher token retention for tokens rich in detail.
  • Figure 4: Detail Density Capture (DDC) visualization. Left: Retained token locations using DDC with a 75% token drop. Right: Grounded conversation generation result using DDC, demonstrating successful generation of the "wall" phrases and mask.
  • Figure 5: Visualization of Dynamic Density Formation (DDF) token allocation. Left: Pixel variance for each sub-area calculated using Equation \ref{['eq:ddf_density']}, indicating higher information density in regions like the curtain. Middle: Corresponding token allocation weights derived from information density via Equation \ref{['eq:ddf_weight']}, showing that tokens in regions with higher variance (e.g., the curtain) receive a larger allocation budget. Right: Final generation results with the DDC and DDF modules under a 25% token allocation setting. Compared to uniform token allocation in DDC as shown in Figure \ref{['fig:ddc_vis']}, the DDF module enables the model to dynamic allocate token budge to the tokens with more information density, therefor, capture and generate detailed objects (e.g., the curtain).
  • ...and 1 more figures