Table of Contents
Fetching ...

Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

Yuhui Lin, Siyue Yu, Yuxing Yang, Guangliang Cheng, Jimin Xiao

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength based on scene complexity, preserving semantic completeness and maintaining balanced attention across layers. Together, they enable context-aware token reduction that maintains essential semantics with lower computation. Comprehensive experiments conducted on five representative 3D vision and language benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, demonstrate that Efficient3D achieves superior performance compared with unpruned baselines, with a +2.57% CIDEr improvement on the Scan2Cap dataset. Therefore, Efficient3D provides a scalable and effective solution for efficient inference in 3D MLLMs. The code is released at: https://github.com/sol924/Efficient3D

Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength based on scene complexity, preserving semantic completeness and maintaining balanced attention across layers. Together, they enable context-aware token reduction that maintains essential semantics with lower computation. Comprehensive experiments conducted on five representative 3D vision and language benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, demonstrate that Efficient3D achieves superior performance compared with unpruned baselines, with a +2.57% CIDEr improvement on the Scan2Cap dataset. Therefore, Efficient3D provides a scalable and effective solution for efficient inference in 3D MLLMs. The code is released at: https://github.com/sol924/Efficient3D

Paper Structure

This paper contains 14 sections, 11 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: (a) illustrates that the initial layers in the 3D MLLM focus more on visual tokens on ScanRefer using LLaVA-1.5-7B. (b) compares global prediction and DVTIE in terms of correctly pruned visual tokens. (c) shows visual token pruning under simple and complex questions, where simple questions typically involve a single object, while complex questions involve multiple objects.
  • Figure 2: Overview of the Efficient3D framework. First, we perform unpruned training on a pretrained 3D MLLM and extract importance scores of visual tokens. Next, we use the visual importance scores as the supervision targets for training the proposed DVTIE. During inference, the 3D MLLM leverages the predicted importance scores from the DVTIE to perform visual token pruning. Furthermore, we propose an ATR strategy that adjusts the pruning ratio adaptively according to scene complexity.
  • Figure 3: Visualization of DVTIE network under different visual token pruning ratios. The results in average pruning ratios of 35%, 65%, and 90%, respectively. Retained objects are marked in orange. Red boxes highlight key objects referenced in prompts.