On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation
Liyao Tang, Zhe Chen, Dacheng Tao
TL;DR
This work targets the high cost of adapting large pre-trained 3D point-cloud transformers to downstream tasks. It introduces Geometry Encoding Mixer (GEM), a geometry-aware PEFT module that combines a Spatial Adapter for local geometry refinement and a Context Adapter that uses a small set of latent tokens to inject global scene context. GEM achieves performance comparable to or better than full fine-tuning while updating only about $1.6\%$ of parameters, and demonstrates strong data efficiency, decoder compatibility, and applicability across indoor and outdoor datasets. These results underscore the value of explicitly modeling geometric cues and global context for scalable, efficient adaptation of large-scale 3D vision models.
Abstract
The emergence of large-scale pre-trained point cloud models has significantly advanced 3D scene understanding, but adapting these models to specific downstream tasks typically demands full fine-tuning, incurring high computational and storage costs. Parameter-efficient fine-tuning (PEFT) techniques, successful in natural language processing and 2D vision tasks, would underperform when naively applied to 3D point cloud models due to significant geometric and spatial distribution shifts. Existing PEFT methods commonly treat points as orderless tokens, neglecting important local spatial structures and global geometric contexts in 3D modeling. To bridge this gap, we introduce the Geometric Encoding Mixer (GEM), a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers. GEM explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context, thereby effectively addressing the spatial and geometric distribution mismatch. Extensive experiments demonstrate that GEM achieves performance comparable to or sometimes even exceeding full fine-tuning, while only updating 1.6% of the model's parameters, fewer than other PEFT methods. With significantly reduced training time and memory requirements, our approach thus sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models. Code is available at https://github.com/LiyaoTang/GEM.
