Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning

Bizhe Bai; Jianjian Cao; Yadan Luo; Tao Chen

Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning

Bizhe Bai, Jianjian Cao, Yadan Luo, Tao Chen

TL;DR

This work tackles the high computational cost of Grounded Conversation Generation (GCG) by preserving local visual information during token pruning. It introduces Adaptive Local-Aware Token Pruning (ALTP), which combines Detail Density Capture (DDC) with Dynamic Density Formation (DDF) to retain region-specific tokens proportional to information density. Experiments on GranDf with GLaMM and OMG-LLaVA show ALTP consistently outperforms FastV and PyramidDrop, achieving up to a 90% reduction in visual tokens while improving AP50, Recall, and mIOU. The results demonstrate that local object details are crucial for grounded dialogue grounding, enabling faster inference without sacrificing grounding accuracy and segmentation quality.

Abstract

Grounded Conversation Generation (GCG) is an emerging vision-language task that requires models to generate natural language responses seamlessly intertwined with corresponding object segmentation masks. Recent models, such as GLaMM and OMG-LLaVA, achieve pixel-level grounding but incur significant computational costs due to processing a large number of visual tokens. Existing token pruning methods, like FastV and PyramidDrop, fail to preserve the local visual features critical for accurate grounding, leading to substantial performance drops in GCG tasks. To address this, we propose Adaptive Local-Aware Token Pruning (ALTP), a simple yet effective framework that accelerates GCG models by prioritizing local object information. ALTP introduces two key components: (1) Detail Density Capture (DDC), which uses superpixel segmentation to retain tokens in object-centric regions, preserving fine-grained details, and (2) Dynamic Density Formation (DDF), which dynamically allocates tokens based on information density, ensuring higher retention in semantically rich areas. Extensive experiments on the GranDf dataset demonstrate that ALTP significantly outperforms existing token pruning methods, such as FastV and PyramidDrop, on both GLaMM and OMG-LLaVA models. Notably, when applied to GLaMM, ALTP achieves a 90% reduction in visual tokens with a 4.9% improvement in AP50 and a 5.0% improvement in Recall compared to PyramidDrop. Similarly, on OMG-LLaVA, ALTP improves AP by 2.1% and mIOU by 3.0% at a 90% token reduction compared with PDrop.

Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning

TL;DR

Abstract

Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)