Table of Contents
Fetching ...

SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

Wenli Li, Kai Zhao, Haoran Jiang, Enquan Yang, Yi Su, Dan Zeng

Abstract

Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.

SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

Abstract

Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.

Paper Structure

This paper contains 26 sections, 11 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison between 2D visual token pruning and spatially-aware pruning for 3D QA. The visualization overlays retained visual tokens on the input image, where transparent regions indicate retained tokens and masked regions denote discarded ones. Conventional 2D pruning methods (top) allocate excessive tokens to background regions due to the lack of spatial awareness. In contrast, integrating spatial information (bottom) encourages more object-centric attention and more uniform sampling in 3D space, preserving richer and more comprehensive scene content for downstream reasoning.
  • Figure 2: Overall framework of our method. Our module is inserted between the visual encoder and the LLM and consists of three components: (1) 3D-Aware Feature Construction, (2) Salient Token Selection, and (3) Diverse Token Selection. Initially, the visual encoder extracts 2D visual tokens from multi-view inputs, where darker colors indicate higher attention scores. We select the top-$k$ tokens as important tokens. To preserve spatial context, the remaining tokens are back-projected into 3D space using the corresponding depth map and processed by the Geometry-aware Token Diversifier to obtain diverse tokens. In the Geometry-aware Token Diversifier, tokens sharing the same shape originate from the same input view. Dashed outlines denote tokens discarded during selection. Finally, the important and diverse tokens are concatenated and fed into the LLM for cross-modal reasoning.
  • Figure 3: Illustration of Geometry-aware Token Diversifier in 3D space. The outer cube represents the 3D scene space, and geometric primitives denote visual tokens extracted from a single image. We first initialize the selected set $\mathcal{D}$ with the highest-attention token $r_0$. Subsequently, we iteratively calculate the minimum fusion distance from the remaining tokens to $\mathcal{D}$ and select the farthest candidate $r_1$ to update the set. This strategy effectively reduces redundancy by discouraging tokens that are spatially or semantically close to those already selected.
  • Figure 4: Qualitative ablation of token selection strategies. (a) Full tokens without reduction. (b) Saliency-only selection (Saliency-aware Token Selector) preserves salient objects but may miss fine-grained details like cables. (c) Diversity-only selection (Geometry-aware Token Diversifier) improves spatial coverage but can overlook semantically important regions on the tabletop. (d) Our method combines saliency and spatial diversity, retaining salient objects while preserving fine-grained structures and producing more continuous and complete structures for large objects such as the table.
  • Figure 5: Inference time comparison between VisPruner vispruner and ours on the ScanQA scanqa validation set. Our method achieves lower latency per example at all token retention ratios.
  • ...and 1 more figures