HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models
Liheng Zhang, Jin Wang, Hui Li, Bingfeng Zhang, Weifeng Liu
TL;DR
This work targets the high computational cost of 3D vision-language models by addressing the LLM bottleneck caused by processing numerous 3D tokens. It introduces Hierarchical Compensatory Compression (HCC-3D), a dual-path architecture with Global Structure Compression (GSC) and Adaptive Detail Mining (ADM) that aggressively reduces 3D tokens while preserving essential geometry and salient local details. GSC summarizes global geometry with a small set of tokens via learnable 3D spatial queries and multi-head attention, while ADM detects under-attended but informative regions using an attention-guided score and complements this with detail queries to recover critical information. Experiments on ModelNet40 and Objaverse show state-of-the-art accuracy with dramatically fewer tokens and faster training, demonstrating practical gains for scalable 3D vision-language systems.
Abstract
3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D-VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.
