Table of Contents
Fetching ...

HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models

Liheng Zhang, Jin Wang, Hui Li, Bingfeng Zhang, Weifeng Liu

TL;DR

This work targets the high computational cost of 3D vision-language models by addressing the LLM bottleneck caused by processing numerous 3D tokens. It introduces Hierarchical Compensatory Compression (HCC-3D), a dual-path architecture with Global Structure Compression (GSC) and Adaptive Detail Mining (ADM) that aggressively reduces 3D tokens while preserving essential geometry and salient local details. GSC summarizes global geometry with a small set of tokens via learnable 3D spatial queries and multi-head attention, while ADM detects under-attended but informative regions using an attention-guided score and complements this with detail queries to recover critical information. Experiments on ModelNet40 and Objaverse show state-of-the-art accuracy with dramatically fewer tokens and faster training, demonstrating practical gains for scalable 3D vision-language systems.

Abstract

3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D-VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.

HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models

TL;DR

This work targets the high computational cost of 3D vision-language models by addressing the LLM bottleneck caused by processing numerous 3D tokens. It introduces Hierarchical Compensatory Compression (HCC-3D), a dual-path architecture with Global Structure Compression (GSC) and Adaptive Detail Mining (ADM) that aggressively reduces 3D tokens while preserving essential geometry and salient local details. GSC summarizes global geometry with a small set of tokens via learnable 3D spatial queries and multi-head attention, while ADM detects under-attended but informative regions using an attention-guided score and complements this with detail queries to recover critical information. Experiments on ModelNet40 and Objaverse show state-of-the-art accuracy with dramatically fewer tokens and faster training, demonstrating practical gains for scalable 3D vision-language systems.

Abstract

3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D-VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.

Paper Structure

This paper contains 23 sections, 10 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Performance comparison of 3D point cloud tokenization methods. (a) 3D Token Compression: HCC achieves 12 tokens vs. 500+ in existing Methods. (b) Relationship chart between token count and classification accuracy. Our HCC-3D uses less 3D tokens yet maintains higher performance. (c) Proportion of inference time. The LLM part of the current 3D VLMs takes over 90% computing costs. Best view in color.
  • Figure 2: Overall architecture of HCC-3D. Left: HCC-3D compresses the 513 tokens output by the point cloud encoder into 12 tokens. Right: (a) Global structure (GSC) compression compress voxel features into global features and output global attention weights through a multi-head attention mechanism. (b) Adaptive Detail Mining (ADM) selects complementary features by leveraging attention weights and intrinsic feature importance.
  • Figure 3: Qualitative results on 3D object understanding tasks. (a) 3D object recognition: comparison between different VLMs on identifying objects (guitar and sofa) from point cloud inputs. (b) 3D question answering: examples showing model responses to questions about 3D object properties including type, color, and material composition.